-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InputStreamResponseListener occasionally hangs waiting for response body chunk #7192
Comments
What OpenJDK vendor are you using? This looks suspicious:
It's the first time I see a "waiting to re-lock" and "waiting on ", as I don't think we have such case in the Jetty code where you could actually see a "waiting to re-lock" unless you are really lucky with the timing of the thread dump. I found this, not sure it's related: If you can reproduce it, can you please enable DEBUG logging and attach the logs? Can you try a more recent Java version (e.g. Java 17)? |
Thank you for your help! 😊
java.vendor = AdoptOpenJDK openjdk version "11.0.7" 2020-04-14
we have tried to enable debug on
We might be able to try, the issue is that our project has a lot of dependencies and some of them might not like newer java version 😓 |
As a shot in the dark, we recently fixed #7157. Can you try the latest HEAD code for branch I know the logs are huge, but if you can reproduce they might be the only way to understand what's going on. |
For sure we will try |
@patsonluk there are snapshot builds available via the official snapshot repository at https://oss.sonatype.org/content/repositories/jetty-snapshots/ You'll want to use |
@joakime thanks! I read this late and built it myself 😓 Unfortunately, that did not seem to fix our issue:
|
The stack trace you report is different from the initial description of this issue, no? At this point we need to be able to reproduce it ourselves, or get from you DEBUG logs, or a reproducible standalone case. Does it fail on non-Mac? Does it fail with different Java versions? |
Sorry about the confusion. The stack trace quoted in the initial comment is when the thread hangs. While the previous stack trace is when the hang is interrupted as eventually there's a "go away" event which terminates the connection. There are 2 layers of this problem:
Thanks! We will see if we can isolate and reproduce the problem using jetty only.
As on Mac, we tried with As on other OS, our production environments are on |
@patsonluk we're interested in getting to the bottom of this issue. Let us know if you can isolate it, otherwise we'll need instructions on how to run this test with your solr fork ourselves. |
@sbordet We've listed some instructions on how to reproduce this from a Solr load testing suite. https://issues.apache.org/jira/browse/SOLR-15840. We'll also attempt to isolate this. |
@sbordet just to add on top of what @chatman offered (which is more complete, that builds/create solr nodes/index fresh docs then execute queries, which the querying is the part that have hangs), here's a self contained archive, which already has a built solr with nodes that have data filled, with a simple java program inside, you can trigger the hang as well. The biggest issue for this program tho, is that it might take more than an hour before the issue is triggered. https://drive.google.com/file/d/1vPjD2oQvdpuk6YXXHLzwOZpULEo-tG7A/view?usp=sharing steps:
We have also tried to take apart logic from Solr and build a standalone app to reproduce the issue, but so far no luck doing that yet. Please let us know if you have any questions! And many thanks for the assistance! |
I'm on |
Is there any way to execute solr so that it's a normal Java process? (so it can be seen with jmc/jconsole/jps/jstack/etc) After about 10 minutes I got ...
|
Yup, the JVM crashed under solr...
|
They eventually run as normal java process, so u should be able to see them in jconsole/jstack? i usually just You shouldn't be getting You can see the log file location by |
@chatman @patsonluk we tried to reproduce the problem with your instructions. We were able to get a failure, so we instrumented the client to understand why it was waiting, and found that it was waiting for a request whose headers arrived fine (with a 200 status code) but the client was waiting for the response body (no errors). We tried to track this request on the server logs, but we could not find any information. From our point of view, the client received response headers and it's waiting for the content, but it never arrives. At this point I'm throwing the ball in your court: are you sure that you are sending the response content? Can you enable JMX on the node server and take a If it's easy for you to reproduce, can you take a The tactic would the as follows: Any suggestion welcome. |
As a side note, in your setup there are many Jetty jars from 9.4.34, some from 9.4.44-SNAPSHOT and some from 9.4.45-SNAPSHOT. |
On the side, we have filed #7259. Does not solve this issue, but will avoid the infinite wait on the client. |
@sbordet Thanks for the updates! Here is the thread dump on data node when my single thread generator hangs (on macbook): And also another thread dump when the data node is serving the single thread generator normally: As for Additionally, we added some debug logging on the solr data node to ensure the hanging thread on query aggregator node side (ie the client, which makes 1000 calls to a solr "shard") has corresponding handling done on the data shard side. Basically adding some extra debug in this https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/response/BinaryResponseWriter.java#L57 call, which writes using https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L172, that eventually call the jetty Are there any other ways that we can debug with some context? I also tried enabling jetty debug on the data node, but it generated too much io load, it's hard to reproduce the issue. Are there any particular classes we can turn debug on that might help? (ie reduce the # of log generated). Many thanks for your help again!! |
Adding a This is the dump (I only sent the last 5000 lines, otherwise the file is way too big, 1G+) The hang starts happening at around |
@sbordet any updates please? 😊 |
@patsonluk the traffic dump is not good. Please use:
|
Jetty version(s)
jetty-client-9.4.34.v20201102 and jetty-client-9.4.44.v20210927
Java version/vendor
(use: java -version)
openjdk version "11.0.7" 2020-04-14
OS type/version
ProductName: macOS
ProductVersion: 11.6.1
BuildVersion: 20G224
Description
We work with a fork of lucene solr which uses jetty http client/server for communication between hosts (solr nodes).
In particular, there's a node that does query aggregation by concurrently sending thousands of http/2 requests (with jetty http/2 client) to each data node. (served by jetty as well) Each node would send back one http response to the query aggregator node. All the operations/communications usually complete way below a second.
We noticed that once in a while, all the data nodes would indicate response is sent (all nodes within the same second), but on the query aggregator side, one thread could be waiting for hours and eventually completes after such long wait.
By inspecting the thread dump, such hanging thread has already read the request header, but waiting for response body chunk to come in for a long time. (ie
InputStreamResponseListener#onHeaders
is invoked, and it's waiting for the lock inInputStreamResponseListener$Input#read
). Those responses are usually very small <1 kB.All of them are waiting at (jetty-client-9.4.34.v20201102) https://github.com/eclipse/jetty.project/blob/jetty-9.4.34.v20201102/jetty-client/src/main/java/org/eclipse/jetty/client/util/InputStreamResponseListener.java#L318 , while such a lock under normal execution flow, should have been released by
onContent
https://github.com/eclipse/jetty.project/blob/jetty-9.4.34.v20201102/jetty-client/src/main/java/org/eclipse/jetty/client/util/InputStreamResponseListener.java#L124Take note that if we use Jetty http 1 client instead, then such issue cannot be reproduced anymore.
How to reproduce?
We can consistently reproduce such by triggering the query logic in the Solr query aggregator node, which sends 1000 concurrent requests to a single data node. Take note that in this case, both the Solr query aggregator node and data node are deployed on the same machine (my mac book), they are just binding to different ports.
This is the thread dump when we use a load generator of one 1 thread (ie allow the query aggregator node to finish one query before processing sending in another one):
solr-jetty-1-thread-dump.txt
And the thread dump when we use the same load generator but with 3 threads (all jetty http client threads on the query aggregator node eventually hang, they start hanging at different times, around 10- 15 mins apart):
solr-jetty-3-thread-dump.txt
We are trying to see if we can reproduce it by removing Solr from the equation by having only jetty.
The text was updated successfully, but these errors were encountered: