-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gracefulClose stops servers due to a lot of TCP states #438
Comments
Please let me know what additional data would be helpful to diagnose the issue. I can provide larger TCP dumps and/or detailed netstat infos if needed. I can also provide direct ssh access to affected machines, if that's helpful for understanding or solving the issue. |
Sorry for the delay. This is probably due to Even if the problem continues, please check if HTTP/2 connections are used. If so, please use |
We are definitely using the newest |
This article would help your understanding: https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939 |
@larskuhtz Would you close this issue if already resolved? |
I had the same experience. I will try to fix. |
Thank you! Should the |
Yes. I will do so. |
@snoyberg I'm CC:ing to you here since the current approach of TCP and server: Warp: When TCP FIN is received, https://github.com/haskell/network/blob/master/Network/Socket/Shutdown.hs#L61 I suspect two things:
But I don't have any clues yet so far. Could you suggest anything? Note that |
It might be wise if we call |
I think it was actually @nh2 who proposed the implementation of |
Approach 4 in this article (https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939) was proposed by you. :-) |
Fair enough :) What I mean is that I don't really know the details of the underlying TCP states, and don't have much insight into that side of the equation. |
If I don't find any solutions for Approach 4 and Approach 3 solves this issue, I will remove Approach 4 and use Approach 3. |
@larskuhtz might also have some thoughts tomorrow. |
I haven't gotten to take a closer look at this for a while. I remember looking at the code for graceful close a while ago and didn't spot anything suspicious. But I am not an expert. I am happy to test changes. It would usually take a few days to reproduce the issue. |
Looking at the code again, I wonder whether the calls Although, I would expect that the callbacks would get canceled, since |
@larskuhtz I'm going to give up the race approach and would like to switch to Would you test https://github.com/kazu-yamamoto/network/tree/next-version? |
The related PR in which Original issue: yesodweb/wai#673
I think that's the key question we need to answer, probably before changing anything.
That's a good point, but it seems to me it cannot affect whether network/Network/Socket/Shutdown.hs Line 61 in d565d9e
@larskuhtz What was the answer to that, can you confirm which protocol is used?
If you send a lot of requests (e.g. with a script), does it reproduce faster? |
@kazu-yamamoto Another potentially related question: For non-threaded, But for threaded, I suspect it is a bug, because it means that after |
Yes.
Possibly. But this bug is not relating to this issue. |
@nh2 If we keep the race approach, we should do:
|
@larskuhtz I'm also testing no race version on https://mew.org/ where I saw the same issue. |
It's some time ago that we last run binaries with the affect version of the network package. We don't keep telemetry data long enough to answer that question now. Most connection are made via clients using It is a P2P network and our software is acting both as server (using warp-tls) and client (using http-client-tls) at the same time. So, the issue may also involve client side behavior.
Not sure. Generally, our network has already a relatively high baseline load. I'll try to find some time tomorrow to build and deploy a node with a recent network version and try it out. |
Could the following observation be some clue:
Not sure if that means if the sockets aren't associated with the application any more. I wasn't aware that sockets in |
Yes. This issue can be reproduced on mew.org. |
@snoyberg and I looked at this in detail and we think that this argument is right. If We are not quite sure if the But none of this could affect whether |
@nh2 @larskuhtz Please give a look at f3da242. I will also test this in the real world. |
It makes sense to me that the GHC runtime gets inactive when a lot of callbacks remain. Many sockets are probably already closed. Some other sockets remain in |
In my field test, modified approach 4 creates never-ending
The timer (0.00/0/0) is never increased. Probably, I will give up the approach 4. |
Now I have many never-ending |
I could confirm such behaviour in our case with |
I have captured a small subset of TCP packets, I think, I should capture a bit longer and provide another example with TCP sessions for
For the ( |
@swamp-agr Do you use HTTP/1.1 on Linux? If you specify 0 to |
Hi @kazu-yamamoto,
If you ask, I will move this case in separate issue. |
@swamp-agr If you use Linux, please check
|
Note that |
Issue reproduced even with these values:
|
@swamp-agr If you believe this is a bug of Warp, please send this issue to Warp. |
With a constant RPS server accumulates stalled sockets. Expected Result: Application handler should finish its job. Warp should respond with You might close the issue. |
@swamp-agr I close this issue. Please bring this issue to Warp. |
I'm now hitting exactly that problem. I asked a question, and provided an answer, on how and why these process-less, FD-less
The answer is:
I haven't figured out yet why my warp application stops |
@nh2 Thank you for bringing this answer! And now I think I can answer your question. Originally, we tried to use the call back approach to avoid forking a new thread. |
@kazu-yamamoto Just for me to get back into context: Where are those callbacks / is this something that was recently changed, or that you plan to change (e.g. open or already-closed issue or PR)? Because I'm still currently investigating what to do about those blocked accepts. Your explanation seems to fit my symptoms ("the entire loops of the IO and Timer manager block are blocked") because the process really seems to stop doing almost everything for a while -- not very good for my web server when it happens :D |
I guess that you are talking about the graceful close. See approach 3 in https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939 |
@kazu-yamamoto Because my server is suffering from 3000
|
@nh2 Understood. |
An update on this: My application was calling This of course caused my process to stop calling any function, including You can read more about it here: It was difficult to figure out because Also scrutinise any libraries for This does not imply that there are no further buts in |
We run a p2p network with Haskell nodes using
network
+tls
+warp
for the server andnetwork
+tls
+http-client
for the client components.We observed that nodes that are using
network
version< 3.1.1.0
have been running without issues for weeks, while nodes that are usingnetwork >=3.1.1.0
are stopping to make and serve requests after running for a few days.Bad nodes don't accept any incoming connections and fail to establish outgoing connections.
On the bad nodes there is no increase in memory consumption and CPU usage is low, since they are not doing anything useful without being able to make network connections. The number of open file descriptors is moderate, but many of the TCP sockets are in a
CLOSE_WAIT
state. Most of those sockets are not listed bylsof
, but are only shown bynetstat
without an associated process.The following are two typical TCP sessions:
HTTP TCP sessions from other processes seem fine.
The text was updated successfully, but these errors were encountered: