Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Research] [BUG] Extension requests show "false" timeout for first few requests to extension due to unknown bug #8652

Closed
DarshitChanpura opened this issue Jul 12, 2023 · 9 comments
Assignees
Labels
bug Something isn't working extensions

Comments

@DarshitChanpura
Copy link
Member

DarshitChanpura commented Jul 12, 2023

Describe the bug
Latest changes introduced via #7957 causing request to false "timeout" even though the request was sent to and processed by extension.
Update: These changes didn't cause the issue. It seems to have existed before this PR was merged.

To Reproduce
Steps to reproduce the behavior:

  1. Checkout main and start helloWorld extension from main on SDK
  2. Start OpenSearch
  3. Initialize the extension following DEVELOPER_GUIDE
  4. Hitting curl -XGET http://localhost:9200/_extensions/_hello-world/hello should produce No response from extension to request.%
  5. You should also see this in OpenSearch logs:
[2023-07-11T19:11:12,938][INFO ][o.o.e.r.RestSendToExtensionAction] [runTask-0] Forwarding the request GET /hello to {hello-world}{hello-world}{UB8MdMJMRoGqXkceAEFC_g}{127.0.0.1}{127.0.0.1:4532}{dimrs}
[2023-07-11T19:21:22,969][INFO ][o.o.e.r.RestSendToExtensionAction] [runTask-0] Received response from extension: OK
  1. After first 5-6 requests that timeout, all the requests that follow should prrocess as normal.

Default timeout is 10 seconds for Extension requests.

Expected behavior
Request should not timeout.

@dbwiddis
Copy link
Member

Personally I've been seeing a lot of (apparently random) timeout issues like this that I believe are associated with Netty Channels and mixing between REST streams and Transport streams. Digging into the logs, the commonality seems to be that the Transport listener doesn't process its onResponse method (which completes the Future) until after the HTTP channel finishes its work.

@cwperks
Copy link
Member

cwperks commented Jul 12, 2023

I have encountered this issue too

@DarshitChanpura
Copy link
Member Author

DarshitChanpura commented Aug 10, 2023

Update: Sought inputs from reta to understand why CompletableFuture is causing the issue. @reta Was unable to re-produce this issue and he had the fresh setup with security plugin installed and main branch of SDK, which brings forth the possibility of stale dependency JARs. Will look into it further.

Update: I was able to re-produce it with a fresh setup on Windows machine. (without security plugin installed)

@DarshitChanpura
Copy link
Member Author

Here are the logs for timeouts:

False timeouts

3.0: (OpenSearch main SDK main) (with-security)

First 3 requests timeout after 10s (default timeout) even though response sent is OK from extension

OpenSearch logs

[2023-08-09T15:35:43,224][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{PjFNOCcSQu2CsJlRMtDSTQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T15:35:53,241][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T15:37:57,205][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{PjFNOCcSQu2CsJlRMtDSTQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T15:38:07,221][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T15:38:18,550][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{PjFNOCcSQu2CsJlRMtDSTQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T15:38:28,563][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK

Hello-world extension logs

15:35:43.243 [opensearch[hello-world][generic][T#3]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
15:37:57.220 [opensearch[hello-world][generic][T#4]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
15:38:18.560 [opensearch[hello-world][generic][T#2]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK

Console:

➜  ~ curl -XGET https://admin:admin@localhost:9200/_extensions/_hello-world/hello --insecure
No response from extension to request.%
➜  ~ curl -XGET https://admin:admin@localhost:9200/_extensions/_hello-world/hello --insecure
No response from extension to request.%
➜  ~ curl -XGET https://admin:admin@localhost:9200/_extensions/_hello-world/hello --insecure
No response from extension to request.%

Now the 4th request is successful:

OpenSearch logs:

[2023-08-09T15:41:19,513][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{PjFNOCcSQu2CsJlRMtDSTQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T15:41:19,561][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK

Hello-world extension logs:

15:38:18.560 [opensearch[hello-world][generic][T#2]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK

Console:

➜  ~ curl -XGET https://admin:admin@localhost:9200/_extensions/_hello-world/hello --insecure
Hello, World!%

Saw the same result without-security too.

2.9: (OpenSearch 2.9.1 SDK 0.1) (Without-security)

First 3 requests timeout after 10s:

OpenSearch logs: (2.9 branch)

[2023-08-09T16:01:25,778][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{vReRvV3KT8-lIo4AODPJGg}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:01:35,800][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:02:01,142][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{vReRvV3KT8-lIo4AODPJGg}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:02:11,156][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:02:19,879][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{vReRvV3KT8-lIo4AODPJGg}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:02:29,890][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK

Hello-world extension : (0.1 branch)

16:01:25.794 [opensearch[hello-world][generic][T#2]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:02:01.156 [opensearch[hello-world][generic][T#4]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:02:19.892 [opensearch[hello-world][generic][T#2]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK

Console:

➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%

4th request onwards it is successful

Opensearch logs:

[2023-08-09T16:02:34,135][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{vReRvV3KT8-lIo4AODPJGg}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:02:34,139][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:04:34,100][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{vReRvV3KT8-lIo4AODPJGg}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:04:34,116][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK

Extension logs:

16:02:34.138 [opensearch[hello-world][generic][T#3]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:04:34.113 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK

Console:

➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
Hello, World!%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
Hello, World!%

Same results when waiting for at-least a min between each request:

Opensearch logs:

[2023-08-09T16:12:50,886][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{gQ8QXqmNSLC5w_iM7FYXMQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:13:00,906][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:16:18,542][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{gQ8QXqmNSLC5w_iM7FYXMQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:16:28,562][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:17:58,031][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{gQ8QXqmNSLC5w_iM7FYXMQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:18:08,045][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:19:35,936][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{gQ8QXqmNSLC5w_iM7FYXMQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:19:35,949][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK
[2023-08-09T16:19:40,110][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Forwarding the request GET /hello to {hello-world}{hello-world}{gQ8QXqmNSLC5w_iM7FYXMQ}{127.0.0.1}{127.0.0.1:4500}{dimrs}
[2023-08-09T16:19:40,120][INFO ][o.o.e.r.RestSendToExtensionAction] [3c06300b34da.ant.amazon.com] Received response from extension: OK

Extension logs:

16:12:50.902 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:16:18.553 [opensearch[hello-world][generic][T#4]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:17:58.061 [opensearch[hello-world][generic][T#3]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:19:35.945 [opensearch[hello-world][generic][T#2]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK
16:19:40.118 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsRestRequestHandler - Sending extension response to OpenSearch: OK

Console:

➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
No response from extension to request.%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
Hello, World!%
➜  ~ curl -XGET http://localhost:9200/_extensions/_hello-world/hello
Hello, World!%

@reta
Copy link
Collaborator

reta commented Aug 15, 2023

What we figured out with @DarshitChanpura :

  • RestSendToExtensionAction (as all other REST actions) is executed in the context of Netty transport_worker
  • RestSendToExtensionAction internally attempts to send another transport request (to extension)
  • but, what is most important here, RestSendToExtensionAction blocks the transport_worker thread, which basically blocks Netty NioEventLoop :(

The conclusion seems to be that RestSendToExtensionAction is not playing well by OpenSearch rules. It should be using final NodeClient client instead of directly accessing the transport service.

@cwperks
Copy link
Member

cwperks commented Aug 15, 2023

@dbwiddis @saratvemulapalli ^ You may be interested in @reta's findings here. Thank you @reta!

@DarshitChanpura
Copy link
Member Author

From our research, we discovered that join() blocks a Netty transport_worker thread waiting for response from extension. Even though response is received from extension, this block is not released due to some unpredictable behavior causing first few requests to timeout.

As a follow-up item from this research:

  • Identify and implement alternative approach to using transportService to send requests

@DarshitChanpura
Copy link
Member Author

DarshitChanpura commented Aug 16, 2023

Here is debugger flow chart for happy path and unhappy path:

Happy path:

flowchart TB
    subgraph Origin
        User 
    end
    
    subgraph RestSendToExtensionAction
        User --> |1| B[transportService.sendRequest]
        B --> C[inProgressFuture.join]
    end
    
    subgraph SDK
        C -->|handleRestExecuteOnExtensionRequest| D[Request Handled]
        D -->|Response Sent| E[Response Sent to OS Cluster]
    end

    subgraph RestSendToExtensionAction
        E --> F[restExecuteOnExtensionResponseHandler.handleResponse]
        F --> G[inProgressFuture.complete]
    end
    
    G --> User
Loading

Unhappy path:

flowchart TB
    subgraph Origin
        User 
    end
    
    subgraph RestSendToExtensionAction
        User --> |1| B[transportService.sendRequest]
        B --> C[inProgressFuture.join]
    end
    
    subgraph SDK
        B -->|handleRestExecuteOnExtensionRequest| D[Request Handled]
        D -->|Response Sent| E[Response Sent to OS Cluster]
    end

    subgraph RestSendToExtensionAction
        C --> H[TimeoutException after 10 sec]
        E --> F[restExecuteOnExtensionResponseHandler.handleResponse]
        F --> G[inProgressFuture.complete]
        G --> X[response dies somewhere]
    end
    
    H --> |No Response received from Extension| User
Loading

We can see different behaviors of inProgressFuture in both cases which is likely the root cause of the issue. I was able to verify that inProgressFuture is handled on same transport_worker thread, Response is handled on same generic worker thread.

1 -> First step in the flow

Setup:

  1. ./gradlew run --debug--server-jvm and then connect your Remote JVM debug to port 5005 (in my case it was Intelli J)
  2. ./gradlew helloWorld in debug mode from Intelli J (Add a debug configuration)

@DarshitChanpura DarshitChanpura changed the title [BUG] Extension requests show "false" timeout for first few requests to extension due to unknown bug [Research] [BUG] Extension requests show "false" timeout for first few requests to extension due to unknown bug Aug 18, 2023
@DarshitChanpura
Copy link
Member Author

Closing this to move all further discussions about the approach here: #9435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working extensions
Projects
None yet
Development

No branches or pull requests

4 participants