-
Notifications
You must be signed in to change notification settings - Fork 38.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response writing fails to complete with WebFlux on Tomcat #26434
Comments
Thanks for the detailed report. We're currently not handling I've made a change in |
Thanks for the explanation! I ran the demo scenario again, and I no longer see response timeouts associated with the flushingFailed case - but I have noticed a new timeout case. The trace logs look like this:
In this case, what I note is that this line occurs:
Aparently while in the |
Thanks for testing. I'll have another look. |
Indeed and that call should be made when the "current" From the output of the current write processor it looks like it writes its single item and requests another, then receives onComplete and switches back to RECEIVED but gets "false" from
So at this point it's waiting on a Any tips on how to reproduce this one? |
I seem to be able to reliably get some reproductions of this on my machine when running the steps described in my original report on this issue. With the one difference being that I have an the following includeBuild clause in my settings.gradle of the demo project:
Which is directory with a local clone of spring-framework checked out at tag
Assuming that it is not reproducing as easily for you, is there anywhere in particular that you think it would be interesting for me to add additional trace logging to gather more information? I should have a bit of time tomorrow to look at this again. |
After the last fix I wasn't seeing timeouts any more and in your latest log I saw a 3 second delay for the handling, so I thought there might be some different instructions. Today I could get several timeouts waiting on a container callback, sometimes its the "current processor" waiting for
If the flush had completed we should see It occurred to me it might be related to the load, so I increased the timeout to 30 seconds but still got the timeouts. The other thing is that // final int delayMillis = ThreadLocalRandom.current().nextInt(10) + latencyBase;
// final int intListSize = ThreadLocalRandom.current().nextInt(MAX_INT_LIST_SIZE);
final int delayMillis = 10 + latencyBase;
final int intListSize = MAX_INT_LIST_SIZE; Now I can no longer reproduce it. |
Yeah, I too was starting to wonder about the possibility of just unlucky scheduling causing occasional highly latent responses under load. However, I've done a bunch more experimenting and I do still think there is some issue in play here. I similarly hard-coded the delay and response size (though the intent of the
This is trace was obtained running the demo as in this feature branch (but also specifying the local spring-framework project with your original fix applied in With the
I did see some errors like the following in the logs as well, but they are apparently not time correlated with the response timeouts - so I don't think they are related:
|
I've changed to the exact same settings from your latest commit. I can see timeouts but all are waiting on a write callback from Tomcat or a flush that's stuck. wrk is called with If I remove the So perhaps connections getting closed from the client side lead under load. From what I can see though this is in Tomcat and I'm not sure what we can do other than set a timeout perhaps. @markt-asf, I'm wondering if you have any comments on this one? Basically WebFlux on Tomcat as in #26407, under load with some connections getting closed from the client side due to a timeout. We're seeing that either onWritePossible never comes or a flush never completes. |
Possibly an instance of https://bz.apache.org/bugzilla/show_bug.cgi?id=65001 ? Updating to Spring Boot 2.4.3 should pick up Tomcat 9.0.43 and the fix for that issue. Better would be to pick up the latest Tomcat 9.0.x from the ASF snapshot repo and test that as there is a potential fix for #26407 in the current 9.0.x dev code. |
I have tried with 9.0-SNAPSHOT and it didn't help. It's something else probably. The case here is also related to clients disconnecting. When the |
Do you have a WAR file I can use to reproduce this on Tomcat? |
This is the sample. You can add For the load:
and the myPost.lua script:
The use of ThreadLocalRandom in the controller seems to make matters worse. Arguably code that can block shouldn't be used in non-blocking handling. Here is a branch where this is replaced with non-random, fixed values. |
All I get if I do that is 404 responses. Looking in JMX it does not appear that any application Servlets or Filters are configured. |
Sorry, my instructions were incomplete. I don't often run Boot apps as a war and I'm having trouble with it. Let me ask, would it work for you to run it in embedded mode, via |
I need to be able to run the test case using my local Tomcat build and to be able to connect my IDE to perform remote debugging. Running as a Java class in the IDE should allow me to do both of those (with a few tweaks to the project configuration) if generating a WAR is proving tricky. |
@rstoyanchev Sorry for misleading you earlier today - Spring Boot doesn't support WAR packaging for WebFlux applications. So this is not a typo or misconfiguration on your side, this is just something we don't support. |
I have this running locally now. Not much luck reproducing any errors though. Some idea of how often I should expect to see errors for a given load command would be helpful. |
FWIW, for me on my machine I was typically able to reproduce the issue multiple times within a minute or so on the
for which I posted the trace above. |
In NioEndpoint a condition was added. The following IOException is thrown to detect this issue:
My issue respectively worries are the following: I receive this message ~20 times a day on my productive environment (CentOS 7, Java 17). Since the update one week ago I receive this issue. Currently there is only less workload on the system but I am worried that next week (after christmas) this bug becomes a big issue for me. I do not understand excatly the problem of it. I have a Vaadin web application (Version 21) and the following questions:
Thanks in advance for the information! |
The error occurs when the operating system reports the same client connection being accepted twice. This typically happens under load but, since the root cause is not known, we have to assume it could happen with any accept. With the new error handling in place, the front-end should not see any side-effects of this. The first accept is handled as normal and the second is simply ignored. This avoids the sort of problems described previously in this issue. Evidence collected so far indicates that the bug dates back at least 3 years and has probably been around longer than that. If you see this issue on Ubuntu then, as the log message indicates, please update the Ubuntu bug. The more people that report they are affected, the more likely the bug is to get some attention. If you see this issue on a different OS then:
|
Thanks for the explanation! Bug report for CentOS: https://bugs.centos.org/view.php?id=18383 |
This is probably not the best place to put my ideas, but I'm hesitant to reg on JDK bug system and LaunchPad, sorry. ☺ As I see it, the Now we have I don't see it as surprising that when the kernel has 6000 of 28232 options consumed it sometimes decides to reuse a recently freed port number. If it does choose a free port number uniformly random, phenomenon should occur once per 22232 requests made. Which isn't too rare. P.S. @markt-asf, I came to this thread after my colleague has found that when configured to listen on UDS, Tomcat rejects all but the first Shrinking the |
That might be a way to trigger a false positive but it misses the point. When the real issue occurs you get:
This isn't a clash with a recently freed port. |
Getting this error constantly on Alpine, started right after we migrated from Spring 2.3.7 to 2.6.2 |
We found that this issue occurred for our projects when we upgraded from Spring Boot version 1.5.9-RELEASE to 2.6.2. Does anyone know what the most recent version of Spring Boot is that is not affected by this? |
The latest spring boot version(2.6.3) is still using tomcat 9.0.56, if you upgrade the tomcat version externally, the problem seems to be fixed. Update: tomcat upgrade is not working, after a while, the same error was occurred again |
I just tried Spring Boot 2.3.7.RELEASE and the same issue arose during load testing. Does anyone actually know 100% what the issue is? But if the above comment by nesrinyesilkaya is correct and upgrading to the latest version of Tomcat fixes the issue, it doesn't sound like an OS issue (unless the latest Tomcat code has made changes to not use certain OS APIs). |
To summarise the current status... The original issue had multiple root causes. Bugs were fixed in Spring and Tomcat. There was also an OS (Linux) bug identified that remains unfixed. If you see this issue reported in the Tomcat logs then the best thing you can do is:
As of Tomcat 10.1.0-M8, 10.0.14, 9.0.56 and 8.5.74 Tomcat will both log a warning if the issue is detected and take steps to ensure that the bug does not impact the running of the application. The Tomcat team is aware of some scenarios that will falsely trigger the detection of this bug. The current Tomcat release round (10.1.0-M11, 10.0.17, 9.0.59 and 8.5.76) will include improved detection that avoids those false positives. Hence why it is important to confirm you are seeing this issue with the test case attached to the Ubuntu bug report before reporting it to your OS vendor. It is possible that other issues will have similar symptoms. As usual, test with the latest stable releases of Spring and Tomcat and, if you still see the problem, open an issue and provide a test case that demonstrates it and it will get investigated. |
This commit ensures handling is cancelled in case of onError/Timeout callback from the Servlet container. Separately we detect the same in ServletServerHttpRequest and ServletServerHttpResponse, which signal onError to the read publisher and cancel writing, but if the onError/Timeout arrives after reading is done and before writing has started (e.g. longer handling), then neither will reach handling. See spring-projectsgh-26434, spring-projectsgh-26407
Previously we registered 3 AsyncListener's from the request, from the response, and from the Servlet adapter. After this change, only the Servlet adapter registers a listener and the others are delegated to. This consolidates the handling of AsyncListener events so that it's easier to discover, trace, and enforce the order of handling. See spring-projectsgh-26434
with tomcat 9.0.58 I still saw this error |
I believe the OS bug is fixed starting in kernel 5.10 (starting in 5.10-rc6). The bug repros in kernel 5.10-rc4 but not 5.10-rc6. I'm fairly sure that the bug fix comes from this commit which describes a race condition in the Linux TCP stack where two duplicate sockets will be created in the established connections hashtable. |
This comment was marked as duplicate.
This comment was marked as duplicate.
with spring 2.5.8 and tomcat 9.0.59 works to me |
I also encountered this problem in ubantu22.04
|
Affects: \v5.3.3
The symptom of this issue is similar to previous issue #23096 (response handling never completes) though the underlying cause is different.
I've reproduced this issue with the small demo spring-boot project which uses WebFlux on Tomcat, seen here: https://github.com/danielra/webflux_race_investigation
The project can be built with
./gradlew clean build
and then run via./start.sh
.To reproduce the problem I've been successful using wrk (https://github.com/wg/wrk) to throw some load at the service, stop and then throw some more etc.. For example:
Where ./myPost.lua looks like:
(I don't think this being a POST request really ended up being relevant, but I started here because I was trying to start the more isolated repro somewhere somewhat similar to the real system that was experiencing the problem.)
This will result in a lot of trace level logging under a
logs
directory in the demo project. This can be searched through to find any occurrences like:And then another search for the logPrefix found in resulting log entries can be performed to gain more context on the processing for the relevant request.
This is an example set of log entries for a repro case (with irrelevant line prefixes ommitted):
Here the line
12:31:47,781 TRACE _.s.h.s.r.AbstractListenerWriteFlushProcessor [http-nio-8080-exec-348] [33d6780] RECEIVED flushingFailed
comes from a modification I made locally to the spring-framework to add an additional trace log line here: https://github.com/spring-projects/spring-framework/blob/v5.3.3/spring-web/src/main/java/org/springframework/http/server/reactive/AbstractListenerWriteFlushProcessor.java#L194-L195like so:
For reference, the relevant Throwable logged at these lines in these reproductions with this demo project looks like:
And appears to me to be a result of the client hanging up the connection before response handling is complete on the server.
After this flushingFailed log line, we can see a relatively large (~12 second) gap in time before the next log line which was triggered by the response timeout being triggered which is setup in the DemoWebFilter in the demo project here: https://github.com/danielra/webflux_race_investigation/blob/main/src/main/java/com/example/demo/filter/DemoWebFilter.java#L38 And only after this timeout does
ServletHttpHandlerAdapter
log a line indicating that handling is complete.I note that the
flushingFailed
method where I added the extra log line above has a comment which reads:Based on this description and the observed behavior, it seems to me that perhaps a notification of the error was expected elsewhere, but it was never actually received so handling of the response is never completed (without a timeout present). I don't know enough yet to say whether that was due to a bug in the code that should have emitted the error notification or on the listening side - or if there should actually be a real non-no-op operation implemented of flushingFailed in the relevant subclass in the Tomcat case. However, my expectation is that regardless of when a client hangs up the connection the server should proceed with completing response handling.
Please let me know if any additional information would be helpful. Thank you for your time!
The text was updated successfully, but these errors were encountered: