Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InputStreamResponseListener occasionally hangs waiting for response body chunk #7192

Closed
patsonluk opened this issue Dec 1, 2021 · 23 comments
Labels
Bug For general bugs on Jetty side

Comments

@patsonluk
Copy link

patsonluk commented Dec 1, 2021

Jetty version(s)
jetty-client-9.4.34.v20201102 and jetty-client-9.4.44.v20210927

Java version/vendor (use: java -version)
openjdk version "11.0.7" 2020-04-14

OS type/version
ProductName: macOS
ProductVersion: 11.6.1
BuildVersion: 20G224

Description
We work with a fork of lucene solr which uses jetty http client/server for communication between hosts (solr nodes).

In particular, there's a node that does query aggregation by concurrently sending thousands of http/2 requests (with jetty http/2 client) to each data node. (served by jetty as well) Each node would send back one http response to the query aggregator node. All the operations/communications usually complete way below a second.

We noticed that once in a while, all the data nodes would indicate response is sent (all nodes within the same second), but on the query aggregator side, one thread could be waiting for hours and eventually completes after such long wait.

By inspecting the thread dump, such hanging thread has already read the request header, but waiting for response body chunk to come in for a long time. (ie InputStreamResponseListener#onHeaders is invoked, and it's waiting for the lock in InputStreamResponseListener$Input#read). Those responses are usually very small <1 kB.

All of them are waiting at (jetty-client-9.4.34.v20201102) https://github.com/eclipse/jetty.project/blob/jetty-9.4.34.v20201102/jetty-client/src/main/java/org/eclipse/jetty/client/util/InputStreamResponseListener.java#L318 , while such a lock under normal execution flow, should have been released by onContent https://github.com/eclipse/jetty.project/blob/jetty-9.4.34.v20201102/jetty-client/src/main/java/org/eclipse/jetty/client/util/InputStreamResponseListener.java#L124

Take note that if we use Jetty http 1 client instead, then such issue cannot be reproduced anymore.

How to reproduce?
We can consistently reproduce such by triggering the query logic in the Solr query aggregator node, which sends 1000 concurrent requests to a single data node. Take note that in this case, both the Solr query aggregator node and data node are deployed on the same machine (my mac book), they are just binding to different ports.

This is the thread dump when we use a load generator of one 1 thread (ie allow the query aggregator node to finish one query before processing sending in another one):
solr-jetty-1-thread-dump.txt

And the thread dump when we use the same load generator but with 3 threads (all jetty http client threads on the query aggregator node eventually hang, they start hanging at different times, around 10- 15 mins apart):
solr-jetty-3-thread-dump.txt

We are trying to see if we can reproduce it by removing Solr from the equation by having only jetty.

@patsonluk patsonluk added the Bug For general bugs on Jetty side label Dec 1, 2021
@sbordet
Copy link
Contributor

sbordet commented Dec 1, 2021

What OpenJDK vendor are you using?

This looks suspicious:

"httpShardExecutor-120-thread-7996-processing-x:fsloadtest r:127.0.0.1 http:////127.0.0.1:8983//solr//fsloadtest_shard326_replica_n638// n:127.0.0.1:8984_solr c:fsloadtest rid:1842 [http:////127.0.0.1:8983//solr//fsloadtest_shard326_replica_n638//]" #8091 prio=5 os_prio=31 cpu=361.76ms elapsed=2036.91s tid=0x00007fb3a7d8e000 nid=0x107e03 in Object.wait()  [0x000070001a7cc000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait([email protected]/Object.java:328)
	at org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318)
	- waiting to re-lock in wait() <0x00000007087175d0> (a org.eclipse.jetty.client.util.InputStreamResponseListener)

It's the first time I see a "waiting to re-lock" and "waiting on ", as I don't think we have such case in the Jetty code where you could actually see a "waiting to re-lock" unless you are really lucky with the timing of the thread dump.

I found this, not sure it's related:
https://bugs.openjdk.java.net/browse/JDK-8150689

If you can reproduce it, can you please enable DEBUG logging and attach the logs?

Can you try a more recent Java version (e.g. Java 17)?

@patsonluk
Copy link
Author

patsonluk commented Dec 2, 2021

Thank you for your help! 😊

What OpenJDK vendor are you using?

java.vendor = AdoptOpenJDK

openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.7+10, mixed mode)

This looks suspicious:

"httpShardExecutor-120-thread-7996-processing-x:fsloadtest r:127.0.0.1 http:////127.0.0.1:8983//solr//fsloadtest_shard326_replica_n638// n:127.0.0.1:8984_solr c:fsloadtest rid:1842 [http:////127.0.0.1:8983//solr//fsloadtest_shard326_replica_n638//]" #8091 prio=5 os_prio=31 cpu=361.76ms elapsed=2036.91s tid=0x00007fb3a7d8e000 nid=0x107e03 in Object.wait()  [0x000070001a7cc000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait([email protected]/Object.java:328)
	at org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318)
	- waiting to re-lock in wait() <0x00000007087175d0> (a org.eclipse.jetty.client.util.InputStreamResponseListener)

It's the first time I see a "waiting to re-lock" and "waiting on ", as I don't think we have such case in the Jetty code where you could actually see a "waiting to re-lock" unless you are really lucky with the timing of the thread dump.

I found this, not sure it's related: https://bugs.openjdk.java.net/browse/JDK-8150689

If you can reproduce it, can you please enable DEBUG logging and attach the logs?

we have tried to enable debug on org.eclipse.jetty.http2 but it produces huge log files, do u have any suggestions of which class/packages we should enable DEBUG for please?

Can you try a more recent Java version (e.g. Java 17)?

We might be able to try, the issue is that our project has a lot of dependencies and some of them might not like newer java version 😓

@sbordet
Copy link
Contributor

sbordet commented Dec 2, 2021

As a shot in the dark, we recently fixed #7157.

Can you try the latest HEAD code for branch jetty-9.4.x?

I know the logs are huge, but if you can reproduce they might be the only way to understand what's going on.

@patsonluk
Copy link
Author

For sure we will try jetty-9.4.x - just a newbie question : I assume we would have to build that ourselves? Or are there any existing repo that builds latest HEAD of such branch?

@joakime
Copy link
Contributor

joakime commented Dec 2, 2021

@patsonluk there are snapshot builds available via the official snapshot repository at https://oss.sonatype.org/content/repositories/jetty-snapshots/

You'll want to use 9.4.45-SNAPSHOT

@patsonluk
Copy link
Author

@joakime thanks! I read this late and built it myself 😓

Unfortunately, that did not seem to fix our issue:

Caused by: java.nio.channels.AsynchronousCloseException
	at org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.close(HttpConnectionOverHTTP2.java:150) ~[http2-http-client-transport-9.4.45-SNAPSHOT.jar:9.4.45-SNAPSHOT]
	at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2.onClose(HttpClientTransportOverHTTP2.java:175) ~[http2-http-client-transport-9.4.45-SNAPSHOT.jar:9.4.45-SNAPSHOT]
	at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onClose(HttpClientTransportOverHTTP2.java:234) ~[http2-http-client-transport-9.4.45-SNAPSHOT.jar:9.4.45-SNAPSHOT]

@sbordet
Copy link
Contributor

sbordet commented Dec 2, 2021

The stack trace you report is different from the initial description of this issue, no?

At this point we need to be able to reproduce it ourselves, or get from you DEBUG logs, or a reproducible standalone case.

Does it fail on non-Mac? Does it fail with different Java versions?

@patsonluk
Copy link
Author

patsonluk commented Dec 2, 2021

The stack trace you report is different from the initial description of this issue, no?

Sorry about the confusion. The stack trace quoted in the initial comment is when the thread hangs. While the previous stack trace is when the hang is interrupted as eventually there's a "go away" event which terminates the connection.

There are 2 layers of this problem:

  1. Usually the requests should all be done within a second, it's unexpected that the client would wait for anything more than that
  2. If point 1. happens, and if the load generator is doing 1 thread, then the hang always seems to get interrupted at 120 second because of connection is marked as "go away" and the InputStreamResponseListener#onFailure is invoked. However, when we use 3 threads in the load generator, we start to observe weird behavior that hanging clients do not time out at 120 seconds anymore - in my tests, it could be 10 or even 20 mins. The strange part is that out of the 3 threads, one thread would start hanging, then also the 2nd one, (could be minutes apart) and then once the last one starts hanging as well, ALL 3 of them will have onFailure invoked almost the same time when the last hanging thread reaches 120 seconds (while the 1st and 2nd threads have been hanging for more than that)

At this point we need to be able to reproduce it ourselves, or get from you DEBUG logs, or a reproducible standalone case.

Thanks! We will see if we can isolate and reproduce the problem using jetty only.

Does it fail on non-Mac? Does it fail with different Java versions?

As on Mac, we tried with OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10) and today tried with oracle Jdk 17 (/Library/Java/JavaVirtualMachines/jdk-17.0.1.jdk), and it's happening on both.

As on other OS, our production environments are on Linux 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux and we observe similar issues (quick server response but the client just hangs, sometimes for hours)

@sbordet
Copy link
Contributor

sbordet commented Dec 2, 2021

@patsonluk we're interested in getting to the bottom of this issue. Let us know if you can isolate it, otherwise we'll need instructions on how to run this test with your solr fork ourselves.

@chatman
Copy link

chatman commented Dec 9, 2021

@sbordet We've listed some instructions on how to reproduce this from a Solr load testing suite. https://issues.apache.org/jira/browse/SOLR-15840. We'll also attempt to isolate this.

@patsonluk
Copy link
Author

patsonluk commented Dec 9, 2021

@sbordet just to add on top of what @chatman offered (which is more complete, that builds/create solr nodes/index fresh docs then execute queries, which the querying is the part that have hangs), here's a self contained archive, which already has a built solr with nodes that have data filled, with a simple java program inside, you can trigger the hang as well. The biggest issue for this program tho, is that it might take more than an hour before the issue is triggered.

https://drive.google.com/file/d/1vPjD2oQvdpuk6YXXHLzwOZpULEo-tG7A/view?usp=sharing

steps:

  1. Extract the archive into a folder. CD into it, it should have 2 items solr-8.8.0 directory and qa-test.jar
  2. CD into solr-8.8.0
  3. ./bin/solr start -c -p 8983 -s example/cloud/node1/solr <- might take several mins to start, need java 8+, by default it uses 4G memory, to tweak use -m 2g for example
  4. ./bin/solr start -c -p 8984 -a "-DSolrQueryAggregator=true" -z localhost:9983
  5. Check if the endpoint is ready by : curl "http://localhost:8984/solr/1000_shard/select?q=*%3A*" . Wait for several minutes until it gives no error (ie return a list of 10 results). Can tail the log in <extract folder>/solr-8.8.0/example/cloud/node1/logs/solr.log to check startup progress
  6. Run the load generator by CD back to the folder with qa-test.jar. Run java -jar qa-test.jar 3 "http://localhost:8984/solr/1000_shard/select?q=*%3A*" . The number 3 means "3 concurrent request generator"
  7. The load generator runs infinitely. The log should print out number of queries per 10 successful queries like pool-1-thread-1 count : 20
  8. If there are any threads hanging, it should start printing out lines like pool-1-thread-1 Long running one rid 360, elapsed 20008 every 10 seconds . Take note that it might take more than an hour sometimes before any thread starts hanging

We have also tried to take apart logic from Solr and build a standalone app to reproduce the issue, but so far no luck doing that yet.

Please let us know if you have any questions! And many thanks for the assistance!

@joakime
Copy link
Contributor

joakime commented Dec 9, 2021

I'm on qa-test.jar now, letting it run on my machine. (64 cores + 128GB ram, linux, java 11)

@joakime
Copy link
Contributor

joakime commented Dec 9, 2021

Is there any way to execute solr so that it's a normal Java process? (so it can be seen with jmc/jconsole/jps/jstack/etc)

After about 10 minutes I got ...

2021-12-09 15:20:21 INFO  QueryAggregatorTest:94 - pool-1-thread-1 count : 10200
2021-12-09 15:20:21 INFO  QueryAggregatorTest:94 - pool-1-thread-1 count : 10210
2021-12-09 15:20:22 INFO  QueryAggregatorTest:94 - pool-1-thread-1 count : 10220
2021-12-09 15:20:22 INFO  QueryAggregatorTest:94 - pool-1-thread-1 count : 10230
Invalid status code 500
Invalid status code 500
2021-12-09 15:20:23 INFO  QueryAggregatorTest:78 - pool-1-thread-2 Long running one rid 5799 finally finished: elapsed 119107
2021-12-09 15:20:23 INFO  QueryAggregatorTest:80 - {
  "responseHeader":{
    "zkConnected":true,
    "status":200,
    "QTime":119106,
    "params":{
      "q":"*:*",
      "rid":"5799"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException",
      "root-error-class","java.io.IOException"],
    "msg":"Error from server at null: parsing error",
    "code":200}}

Invalid status code 500
Invalid status code 500
2021-12-09 15:20:23 INFO  QueryAggregatorTest:94 - pool-1-thread-3 count : 10240
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
Invalid status code 500
...(snip thousands of lines)...
2021-12-09 15:20:39 INFO  QueryAggregatorTest:94 - pool-1-thread-3 count : 10480
[pool-1-thread-2] INFO org.apache.http.impl.execchain.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8984: The target server failed to respond
[pool-1-thread-3] INFO org.apache.http.impl.execchain.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8984: The target server failed to respond
[pool-1-thread-1] INFO org.apache.http.impl.execchain.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8984: The target server failed to respond
[pool-1-thread-3] INFO org.apache.http.impl.execchain.RetryExec - Retrying request to {}->http://localhost:8984
[pool-1-thread-1] INFO org.apache.http.impl.execchain.RetryExec - Retrying request to {}->http://localhost:8984
[pool-1-thread-2] INFO org.apache.http.impl.execchain.RetryExec - Retrying request to {}->http://localhost:8984
org.apache.http.conn.HttpHostConnectException: Connect to localhost:8984 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:394)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
	at org.apache.solr.benchmarks.solrcloud.QueryAggregatorTest$QueryExecutor.run(QueryAggregatorTest.java:71)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	... 14 more
org.apache.http.conn.HttpHostConnectException: Connect to localhost:8984 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:394)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
	at org.apache.solr.benchmarks.solrcloud.QueryAggregatorTest$QueryExecutor.run(QueryAggregatorTest.java:71)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	... 14 more
org.apache.http.conn.HttpHostConnectException: Connect to localhost:8984 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:394)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
	at org.apache.solr.benchmarks.solrcloud.QueryAggregatorTest$QueryExecutor.run(QueryAggregatorTest.java:71)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	... 14 more

@joakime
Copy link
Contributor

joakime commented Dec 9, 2021

Yup, the JVM crashed under solr...

$ cat solr-8983-console.log 
2021-12-09 21:08:22.182 INFO  (main) [   ] o.e.j.u.log Logging initialized @836ms to org.eclipse.jetty.util.log.Slf4jLog
2021-12-09 21:08:22.388 INFO  (main) [   ] o.e.j.s.Server jetty-9.4.45-SNAPSHOT; built: 2021-12-02T15:52:11.060Z; git: d55a9173d5ec9741320ac90d3a26f5eea0e93fa2; jvm 11.0.12+7
2021-12-09 21:08:22.407 INFO  (main) [   ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:///home/joakim/code/opensource/solr-issue-7192/solr-8.8.0/server/contexts/] at interval 0
2021-12-09 21:08:22.592 INFO  (main) [   ] o.e.j.w.StandardDescriptorProcessor NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet
2021-12-09 21:08:22.600 INFO  (main) [   ] o.e.j.s.session DefaultSessionIdManager workerName=node0
2021-12-09 21:08:22.601 INFO  (main) [   ] o.e.j.s.session No SessionScavenger set, using defaults
2021-12-09 21:08:22.602 INFO  (main) [   ] o.e.j.s.session node0 Scavenging every 600000ms
2021-12-09 21:08:22.649 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory
2021-12-09 21:08:22.653 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___      _       Welcome to Apache Solr™ version 8.8.1-SNAPSHOT 681baac6f489a57c3fe64bf35dcfd513ab39b4e6 - patson - 2021-12-06 10:05:16
2021-12-09 21:08:22.653 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _   Starting in cloud mode on port 8983
2021-12-09 21:08:22.653 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_|  Install dir: /home/joakim/code/opensource/solr-issue-7192/solr-8.8.0
2021-12-09 21:08:22.653 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|    Start time: 2021-12-09T21:08:22.653452Z
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f40acdb86ec, pid=948630, tid=948717
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.12+7 (11.0.12+7) (build 11.0.12+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.12+7 (11.0.12+7, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x65e6ec]  oopDesc::size_given_klass(Klass*)+0x2c
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/joakim/code/opensource/solr-issue-7192/solr-8.8.0/server/core.948630)
#
# An error report file with more information is saved as:
# /home/joakim/code/opensource/solr-issue-7192/solr-8.8.0/server/hs_err_pid948630.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

@patsonluk
Copy link
Author

patsonluk commented Dec 10, 2021

They eventually run as normal java process, so u should be able to see them in jconsole/jstack?

i usually just ps ax | grep 8983 or ps ax | grep 8984 to find the 2 solr java processes, (8983/8984 are the ports they listen to, 8983 is the "data node", 8984 is the "query aggregator node" which the hang happens)

You shouldn't be getting Invalid status code 500 , unless something went wrong on the solr side (queue size exceeded, out of memory etc). I know on my local stack, if i go > 3 threads, it will hit some queue size limit. Can you se what's in the log file?

You can see the log file location by ps ax | grep 8984 and look for the phrase -Dsolr.log.dir=... look at solr.log under that directory

@sbordet
Copy link
Contributor

sbordet commented Dec 10, 2021

@chatman @patsonluk we tried to reproduce the problem with your instructions.

We were able to get a failure, so we instrumented the client to understand why it was waiting, and found that it was waiting for a request whose headers arrived fine (with a 200 status code) but the client was waiting for the response body (no errors).

We tried to track this request on the server logs, but we could not find any information.
Any attempt to increment the log level to DEBUG makes the failure non-reproducible, so we are a bit stuck trying to understand what's going on.

From our point of view, the client received response headers and it's waiting for the content, but it never arrives.
We cannot explain how the response headers arrived, but not the content, but we don't have information about the server -- whether there was an error on the server or similar, because the logs don't show any error.

At this point I'm throwing the ball in your court: are you sure that you are sending the response content?
I know you said that for HTTP/1 you did not see the error, but maybe the error is there too and it's not dependent on the protocol?

Can you enable JMX on the node server and take a Server.dump() when the issue happens?
(On a side note, how do I turn on JMX support on the node and aggregator servers?)

If it's easy for you to reproduce, can you take a tcpdump capture? In this way we would know if the server actually sent the response content or not. I assume there is a way to correlate the request that failed with the HTTP/2 frames (in particular, make the request URI unique).

The tactic would the as follows:
If we can track the tcpdump and see that the HTTP/2 frames reach the client, then it's a Jetty client problem;
otherwise the server did not send the HTTP/2 frames and this could be either for a Jetty server problem, or because the application did not send any data.
In this way we're trying to restrict the problem.

Any suggestion welcome.

@sbordet
Copy link
Contributor

sbordet commented Dec 10, 2021

As a side note, in your setup there are many Jetty jars from 9.4.34, some from 9.4.44-SNAPSHOT and some from 9.4.45-SNAPSHOT.
I know it's for testing and all, but please make sure that in an official solr distro they are all consistent to just one Jetty version.

@sbordet
Copy link
Contributor

sbordet commented Dec 10, 2021

On the side, we have filed #7259.

Does not solve this issue, but will avoid the infinite wait on the client.

@patsonluk
Copy link
Author

patsonluk commented Dec 11, 2021

@sbordet Thanks for the updates!

Here is the thread dump on data node when my single thread generator hangs (on macbook):
data-node-hang-dump.txt

And also another thread dump when the data node is serving the single thread generator normally:
data-node-normal-dump.txt

As for tcpdump it's very tricky as there are a huge amount of packets going back and forth. In this load test, though single threaded in the load generator, once it reaches the query aggregator, it will be split into 1000 requests (http2 calls) simultaneously submitted to the qa node. Therefore even if we try to tcpdump, it might be hard to sift through the data? (I tried briefly but was overwhelmed by the amount of data) .

Additionally, we added some debug logging on the solr data node to ensure the hanging thread on query aggregator node side (ie the client, which makes 1000 calls to a solr "shard") has corresponding handling done on the data shard side. Basically adding some extra debug in this https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/response/BinaryResponseWriter.java#L57 call, which writes using https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L172, that eventually call the jetty org.eclipse.jetty.server.HttpOutput#write. It's confirmed that at least on high level, the shard handling on the server side has completed BinaryResponseWriter#marshal call which writes response (and flush). But of course, this is only a high level call, things could still be "stuck" in a lower level which prevents some data frames from returning to the client.

Are there any other ways that we can debug with some context? I also tried enabling jetty debug on the data node, but it generated too much io load, it's hard to reproduce the issue. Are there any particular classes we can turn debug on that might help? (ie reduce the # of log generated).

Many thanks for your help again!!

@patsonluk
Copy link
Author

Adding a tcpdump when the hang happens. I used sudo tcpdump -i lo0 | grep localhost.8983 to capture traffic to port 8983 (which the data node listens to). Are there any extra flags we could use to add more diagnostic info? Sorry I am very new to tcpdump 😓

This is the dump (I only sent the last 5000 lines, otherwise the file is way too big, 1G+)

tcp-dump-tail.txt

The hang starts happening at around 17:47:33, which we can see significantly less traffic to the data node (there could still be other health checks/metrics requests etc going in). And after 120 secs, at 17:49:33, the connection times out and the load generator was stopped (and we stopped the tcpdump too)

@patsonluk
Copy link
Author

@sbordet any updates please? 😊

@sbordet
Copy link
Contributor

sbordet commented Jan 6, 2022

@patsonluk the traffic dump is not good.

Please use:

tcpdump -i lo0 -w /tmp/output.pcap tcp port 8983

@joakime
Copy link
Contributor

joakime commented Sep 14, 2022

The new Issue #8558 shows the details we were missing.
I'm closing this specific issue, see #8558 for the effort to fix this issue.

@joakime joakime closed this as completed Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
None yet
Development

No branches or pull requests

4 participants