Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefulClose stops servers due to a lot of TCP states #438

Closed
larskuhtz opened this issue Feb 13, 2020 · 53 comments
Closed

gracefulClose stops servers due to a lot of TCP states #438

larskuhtz opened this issue Feb 13, 2020 · 53 comments
Assignees

Comments

@larskuhtz
Copy link

We run a p2p network with Haskell nodes using network + tls + warp for the server and network + tls + http-client for the client components.

We observed that nodes that are using network version < 3.1.1.0 have been running without issues for weeks, while nodes that are using network >=3.1.1.0 are stopping to make and serve requests after running for a few days.

Bad nodes don't accept any incoming connections and fail to establish outgoing connections.

On the bad nodes there is no increase in memory consumption and CPU usage is low, since they are not doing anything useful without being able to make network connections. The number of open file descriptors is moderate, but many of the TCP sockets are in a CLOSE_WAIT state. Most of those sockets are not listed by lsof, but are only shown by netstat without an associated process.

The following are two typical TCP sessions:

No.	Time	Source	Destination	Protocol	Length	Info
28	2.100345	172.31.20.24	47.108.144.139	TCP	66	41372 → 9443 [FIN, ACK] Seq=1 Ack=1 Win=211 Len=0 TSval=2546593558 TSecr=1766578430
30	2.373855	47.108.144.139	172.31.20.24	TCP	198	9443 → 41372 [PSH, ACK] Seq=1 Ack=2 Win=227 Len=132 TSval=1766583433 TSecr=2546593558 [TCP segment of a reassembled PDU]
31	2.373887	172.31.20.24	47.108.144.139	TCP	54	41372 → 9443 [RST] Seq=2 Win=0 Len=0
32	2.373894	47.108.144.139	172.31.20.24	TCP	98	9443 → 41372 [PSH, ACK] Seq=133 Ack=2 Win=227 Len=32 TSval=1766583433 TSecr=2546593558 [TCP segment of a reassembled PDU]
33	2.373897	172.31.20.24	47.108.144.139	TCP	54	41372 → 9443 [RST] Seq=2 Win=0 Len=0
34	2.373899	47.108.144.139	172.31.20.24	TCP	84	9443 → 41372 [PSH, ACK] Seq=165 Ack=2 Win=227 Len=18 TSval=1766583433 TSecr=2546593558 [TCP segment of a reassembled PDU]
35	2.373901	172.31.20.24	47.108.144.139	TCP	54	41372 → 9443 [RST] Seq=2 Win=0 Len=0
36	2.373902	47.108.144.139	172.31.20.24	HTTP	66	HTTP/1.1 426 Upgrade Required          (text/plain)
37	2.373904	172.31.20.24	47.108.144.139	TCP	54	41372 → 9443 [RST] Seq=2 Win=0 Len=0
No.	Time	Source	Destination	Protocol	Length	Info
161	7.803316	47.245.52.190	172.31.20.24	TCP	74	43686 → 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2824405820 TSecr=0 WS=128
220	8.818459	47.245.52.190	172.31.20.24	TCP	74	[TCP Retransmission] 43686 → 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2824406835 TSecr=0 WS=128
244	10.834435	47.245.52.190	172.31.20.24	TCP	74	[TCP Retransmission] 43686 → 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2824408851 TSecr=0 WS=128

HTTP TCP sessions from other processes seem fine.

@larskuhtz
Copy link
Author

Please let me know what additional data would be helpful to diagnose the issue. I can provide larger TCP dumps and/or detailed netstat infos if needed. I can also provide direct ssh access to affected machines, if that's helpful for understanding or solving the issue.

@kazu-yamamoto
Copy link
Collaborator

Sorry for the delay. This is probably due to gracefulClose. First of all, if you are using warp < 3.3.4, please upgrade to warp >= 3.3.5. Since 3.3.5, warp uses close() for HTTP/1.1 by default.

Even if the problem continues, please check if HTTP/2 connections are used. If so, please use setGracefulCloseTimeout2 0 to disable gracefulClose for HTTP/2. If this fixes the problem, the source of bug is definitely gracefulClose. Timeout to close sockets is not handled correctly.

@fosskers
Copy link

We are definitely using the newest warp, so we'll try it with setGracefulCloseTimeout.

@kazu-yamamoto
Copy link
Collaborator

This article would help your understanding: https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939

@kazu-yamamoto
Copy link
Collaborator

@larskuhtz Would you close this issue if already resolved?

@kazu-yamamoto kazu-yamamoto self-assigned this May 19, 2020
@kazu-yamamoto kazu-yamamoto changed the title Regression with version >= 3.1.1.0 gracefullclose stops servers due to a lot of TCP states May 19, 2020
@kazu-yamamoto
Copy link
Collaborator

I had the same experience. I will try to fix.

@fosskers
Copy link

fosskers commented May 19, 2020

Thank you! Should the 3.1.1.x series be marked as deprecated on Hackage, once 3.1.2.0 is released?

@kazu-yamamoto
Copy link
Collaborator

Yes. I will do so.

@kazu-yamamoto
Copy link
Collaborator

kazu-yamamoto commented May 19, 2020

@snoyberg I'm CC:ing to you here since the current approach of gracefulClose was suggested by you and this is relating to Warp.

TCP and server: CLOSE_WAIT is the state that the TCP stack of the server received TCP FIN but not send TCP FIN. This means the server does not close(2) the socket yet. The TCP stack waits forever in this situation.

Warp: When TCP FIN is received, connRecv returns "". serve in fork returns and then connClose is called. So far, so good. But gracefulClose cannot call close. Why?

https://github.com/haskell/network/blob/master/Network/Socket/Shutdown.hs#L61

I suspect two things:

  1. Asynchronous exception from Warp's time manager
  2. Callback is not fired by GHC TimerManager

But I don't have any clues yet so far. Could you suggest anything?

Note that shutdown can throw an exception.

@kazu-yamamoto
Copy link
Collaborator

It might be wise if we call recvBufNoWait after shutdown for the case where FIN is already received. (Like C's do { } while () loop).

@kazu-yamamoto kazu-yamamoto changed the title gracefullclose stops servers due to a lot of TCP states gracefulClose stops servers due to a lot of TCP states May 19, 2020
@snoyberg
Copy link

I think it was actually @nh2 who proposed the implementation of gracefulClose. Maybe he has some thoughts, I unfortunately don't.

@kazu-yamamoto
Copy link
Collaborator

Approach 4 in this article (https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939) was proposed by you. :-)

@snoyberg
Copy link

Fair enough :) What I mean is that I don't really know the details of the underlying TCP states, and don't have much insight into that side of the equation.

@kazu-yamamoto
Copy link
Collaborator

If I don't find any solutions for Approach 4 and Approach 3 solves this issue, I will remove Approach 4 and use Approach 3.

@fosskers
Copy link

@larskuhtz might also have some thoughts tomorrow.

@larskuhtz
Copy link
Author

I haven't gotten to take a closer look at this for a while. I remember looking at the code for graceful close a while ago and didn't spot anything suspicious. But I am not an expert. I am happy to test changes. It would usually take a few days to reproduce the issue.

@larskuhtz
Copy link
Author

larskuhtz commented May 20, 2020

Looking at the code again, I wonder whether the calls putMVar mvar TimeoutTripped and putMVar mvar MoreData should use tryPutMVar instead of putMVar so that the looser of the race doesn't get stuck.

Although, I would expect that the callbacks would get canceled, since putMVar is interruptible.

@kazu-yamamoto
Copy link
Collaborator

@larskuhtz I'm going to give up the race approach and would like to switch to threadDelay.

Would you test https://github.com/kazu-yamamoto/network/tree/next-version?
b1c8b6e is the hack.

@nh2
Copy link
Member

nh2 commented May 20, 2020

The related PR in which gracefulClose was introduced: yesodweb/wai#759

Original issue: yesodweb/wai#673

But gracefulClose cannot call close. Why?

I think that's the key question we need to answer, probably before changing anything.

I wonder whether the calls putMVar mvar TimeoutTripped and putMVar mvar MoreData should use tryPutMVar instead of putMVar so that the looser of the race doesn't get stuck.

That's a good point, but it seems to me it cannot affect whether close is called. wait <- takeMVar mvar should terminate as long as anybody calls putMVar, and then recvEOFev should terminate and thus E.finally trigger close s in

gracefulClose s tmout = sendRecvFIN `E.finally` close s

please check if HTTP/2 connections are used

@larskuhtz What was the answer to that, can you confirm which protocol is used?

It would usually take a few days to reproduce the issue.

If you send a lot of requests (e.g. with a script), does it reproduce faster?

@nh2
Copy link
Member

nh2 commented May 20, 2020

@kazu-yamamoto Another potentially related question:

For non-threaded, recvEOFloop loops and calls recvBufNoWait s buf multiple times.

But for threaded, recvEOFev calls it only once and does not loop. Is that intended?

I suspect it is a bug, because it means that after shutdown(), warp only receives 1024 more bytes and then immediately calls close(), thus not waiting up until the specified grace timeout of e.g. 2 seconds, and killing the connection like in the original issue. This is likely to happen immediately, because the other side likely sent more than 1024 bytes in flight, so the recvfrom() will instantly return the 1024 bytes.

@kazu-yamamoto
Copy link
Collaborator

But for threaded, recvEOFev calls it only once and does not loop. Is that intended?

Yes.

I suspect it is a bug, because it means that after shutdown(), warp only receives 1024 more bytes and then immediately calls close(), thus not waiting up until the specified grace timeout of e.g. 2 seconds, and killing the connection like in the original issue. This is likely to happen immediately, because the other side likely sent more than 1024 bytes in flight, so the recvfrom() will instantly return the 1024 bytes.

Possibly. But this bug is not relating to this issue.

@kazu-yamamoto
Copy link
Collaborator

@nh2 If we keep the race approach, we should do:

  • Call recvBufNoWai first (and then call recvEOFev) for the case where FIN is already received
  • Change putMVar to tryPutMVar as suggested by @larskuhtz

@kazu-yamamoto
Copy link
Collaborator

@larskuhtz I'm also testing no race version on https://mew.org/ where I saw the same issue.

@larskuhtz
Copy link
Author

@nh2

please check if HTTP/2 connections are used

@larskuhtz What was the answer to that, can you confirm which protocol is used?

It's some time ago that we last run binaries with the affect version of the network package. We don't keep telemetry data long enough to answer that question now. Most connection are made via clients using http-client-tls. If I remember correctly http-client only supports HTTP/1.1, but I am not 100% sure. Since those are public servers on a public P2P network I can't rule out that some number of connections also comes from other clients, possibly using HTTP/2. But that would most likely be only a small fraction.

It is a P2P network and our software is acting both as server (using warp-tls) and client (using http-client-tls) at the same time. So, the issue may also involve client side behavior.

It would usually take a few days to reproduce the issue.

If you send a lot of requests (e.g. with a script), does it reproduce faster?

Not sure. Generally, our network has already a relatively high baseline load. I'll try to find some time tomorrow to build and deploy a node with a recent network version and try it out.

@larskuhtz
Copy link
Author

Could the following observation be some clue:

... many of the TCP sockets are in a CLOSE_WAIT state. Most of those sockets are not listed by lsof, but are only shown by netstat without an associated process.

Not sure if that means if the sockets aren't associated with the application any more. I wasn't aware that sockets in CLOSE_WAIT state can be disassociated from the application that opened them.

@kazu-yamamoto
Copy link
Collaborator

kazu-yamamoto commented May 25, 2020

Yes. This issue can be reproduced on mew.org.

@nh2
Copy link
Member

nh2 commented May 26, 2020

Looking at the code again, I wonder whether the calls putMVar mvar TimeoutTripped and putMVar mvar MoreData should use tryPutMVar instead of putMVar so that the looser of the race doesn't get stuck.

@snoyberg and I looked at this in detail and we think that this argument is right. If putMVar mvar MoreData happens first, but an async exception arrives before/during wait <- takeMVar mvar, then putMVar mvar TimeoutTripped could block forever.

We are not quite sure if the Ev.unregisterTimeout will correctly cancel a stuck putMVar mvar (in which case it's fine that putMVar is used), but by using tryPutMVar we would not even have to think about it.

But none of this could affect whether close is called on the socket.

kazu-yamamoto added a commit to kazu-yamamoto/network that referenced this issue May 26, 2020
@kazu-yamamoto
Copy link
Collaborator

@nh2 @larskuhtz Please give a look at f3da242. I will also test this in the real world.

@kazu-yamamoto
Copy link
Collaborator

It makes sense to me that the GHC runtime gets inactive when a lot of callbacks remain. Many sockets are probably already closed. Some other sockets remain in CLOSE_WAIT.

@kazu-yamamoto
Copy link
Collaborator

In my field test, modified approach 4 creates never-ending CLOSE_WAIT:

% sudo netstat -anpto | grep -i wait
tcp 1 0 XX.XX.XX.XX:80 YY.YY.YY.YY:29824 CLOSE_WAIT 1464/./mighty-20200 off (0.00/0/0)

The timer (0.00/0/0) is never increased.

Probably, I will give up the approach 4.

@kazu-yamamoto
Copy link
Collaborator

Now I have many never-ending CLOSE_WAIT. I give up approach 4. I will merge b1c8b6e to master.

@swamp-agr
Copy link

swamp-agr commented Feb 11, 2021

I could confirm such behaviour in our case with network-3.1.2.1 and warp-3.3.14 with both zero and non-zero setGracefulCloseTimeout1 setting. I will try to provide more details (e.g. tcpdump) for stalled connections in CLOSE_WAIT state.

@swamp-agr
Copy link

I have captured a small subset of TCP packets, I think, I should capture a bit longer and provide another example with TCP sessions for CLOSE_WAIT cases.
Meantime:

17:50:11.885845 IP localhost.61105 > localhost.8080: Flags [F.], seq 3522902604, ack 1555808692, win 43, options [nop,nop,TS val 1221225641 ecr 1221223145], length 0
17:50:11.922909 IP localhost.8080 > localhost.61105: Flags [.], ack 1, win 171, options [nop,nop,TS val 1221225651 ecr 1221225641], length 0
17:50:17.227761 IP localhost.8080 > localhost.59533: Flags [P.], seq 3877136620:3877140818, ack 3153943839, win 45, options [nop,nop,TS val 1221226977 ecr 1221201838], length 4198
17:50:17.227800 IP localhost.59533 > localhost.8080: Flags [R], seq 3153943839, win 0, length 0
17:50:19.461116 IP localhost.61396 > localhost.8080: Flags [S], seq 750070939, win 43690, options [mss 65495,sackOK,TS val 1221227535 ecr 0,nop,wscale 10], length 0
17:50:19.461127 IP localhost.8080 > localhost.61396: Flags [S.], seq 2450982214, ack 750070940, win 43690, options [mss 65495,sackOK,TS val 1221227535 ecr 1221227535,nop,wscale 10], length 0
17:50:19.461137 IP localhost.61396 > localhost.8080: Flags [.], ack 1, win 43, options [nop,nop,TS val 1221227535 ecr 1221227535], length 0
17:50:19.461161 IP localhost.61396 > localhost.8080: Flags [P.], seq 1:922, ack 1, win 43, options [nop,nop,TS val 1221227535 ecr 1221227535], length 921
17:50:19.461177 IP localhost.8080 > localhost.61396: Flags [.], ack 922, win 45, options [nop,nop,TS val 1221227535 ecr 1221227535], length 0

For the (.61105) case I expected subsequent FIN from Warp (.8080).
Rest two cases seems to be incompleted.

@kazu-yamamoto
Copy link
Collaborator

@swamp-agr Do you use HTTP/1.1 on Linux?

If you specify 0 to setGracefulCloseTimeout1, close(2) is used. So, this is not related to graceful close. Actually, I never see CLOSE_WAIT on Linux with network 3.1.2.1 so far.

@swamp-agr
Copy link

swamp-agr commented Feb 12, 2021

Hi @kazu-yamamoto,

  • Our case is reproducing under const 400+ rps load after more than 7h server uptime.
  • We use nginx-1.11.2 as a proxy with keepalive for upstream Warp. (In this case connections in TIME_WAIT state should be reused).
  • Client is usually dropping connection after 3-5s of waiting.
  • AFAIK, we use HTTP/1.1. Despite that I will specifically add proxy http/1.1 header.

If you ask, I will move this case in separate issue.

@kazu-yamamoto
Copy link
Collaborator

@swamp-agr If you use Linux, please check net.ipv4.tcp_fin_timeout. The default value, 60 (second), might be too long for your use case. Also, please check net.ipv4.tcp_tw_reuse. This should be 1 for your case. The following settings for /etc/sysctl.conf would help:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

@kazu-yamamoto
Copy link
Collaborator

Note that tcp_tw_recycle would be also related. But this is a bit dangerous and was removed Linux 4.1.2.
Anyway, please look into these parameters.

@swamp-agr
Copy link

Issue reproduced even with these values:

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=15

tcp_tw_recycle is indeed dangerous since enabling it make server suffers and drops connections. I think it is incompatible with NGINX keepalive setting for upstream server.

@swamp-agr
Copy link

I think, the problem is not in close but in another place. I am not sure which one.
Please let me know if I should open new issue or move to warp.

Meantime I gathered two pictures (server is up and running and sockets started to leak).

  • Server is up and running:
    normal_sockets

  • Leaking sockets picture:
    leaking_sockets

Screenshot 2021-02-19 at 21 19 21

@kazu-yamamoto
Copy link
Collaborator

@swamp-agr If you believe this is a bug of Warp, please send this issue to Warp.

@swamp-agr
Copy link

swamp-agr commented Feb 20, 2021

  • Root cause found inside application code (as usual).
  • Handlers were running indefinitely. It causes warp to wait endlessly.
  • Client dropped connection.
  • NGINX closed socket from its side.
  • Warp still waits for application handler.

With a constant RPS server accumulates stalled sockets.

Expected Result: Application handler should finish its job. Warp should respond with close.
Acutal Result: Application never stopped. Warp is waiting.

You might close the issue.

@kazu-yamamoto
Copy link
Collaborator

@swamp-agr I close this issue. Please bring this issue to Warp.

@nh2
Copy link
Member

nh2 commented Nov 4, 2023

@larskuhtz @swamp-agr

The number of open file descriptors is moderate, but many of the TCP sockets are in a CLOSE_WAIT state. Most of those sockets are not listed by lsof, but are only shown by netstat without an associated process.

I'm now hitting exactly that problem.

I asked a question, and provided an answer, on how and why these process-less, FD-less CLOSE_WAITs exist:

The answer is:

A CLOSE_WAIT state without associated process occurs when a client waiting in the Linux kernel's listen() backlog queue disconnects before the user-space application accept()s it.

I haven't figured out yet why my warp application stops accept()ing for multiple minutes, creating the CLOSE_WAITs.

@kazu-yamamoto
Copy link
Collaborator

@nh2 Thank you for bringing this answer!

And now I think I can answer your question.
Callbacks for the IO and Timer managers MUST NOT be blocked.
If they are blocked, the entire loops of the IO and Timer manager block are blocked.
In this situation, any bad things can happen (including the non-accepting).
If the callbacks can be blocked, we MUST use forkIO.
That's why timeout calls forkIO if necessary.

Originally, we tried to use the call back approach to avoid forking a new thread.
But it is not avoidable to call forkIO in this approach.
Thus, the timeout approach is better because forkIO is called only when it is necessary.

@nh2
Copy link
Member

nh2 commented Nov 9, 2023

Thus, the timeout approach is better because forkIO is called only when it is necessary.

@kazu-yamamoto Just for me to get back into context:

Where are those callbacks / is this something that was recently changed, or that you plan to change (e.g. open or already-closed issue or PR)?

Because I'm still currently investigating what to do about those blocked accepts.

Your explanation seems to fit my symptoms ("the entire loops of the IO and Timer manager block are blocked") because the process really seems to stop doing almost everything for a while -- not very good for my web server when it happens :D

@kazu-yamamoto
Copy link
Collaborator

Where are those callbacks / is this something that was recently changed, or that you plan to change (e.g. open or already-closed issue or PR)?

I guess that you are talking about the graceful close.
Our final decision was to adopt the threadDelay approach.
So, our gracefull close is NOT suffering from CLOSE_WAIT.

See approach 3 in https://kazu-yamamoto.hatenablog.jp/entry/2019/09/20/165939
Probably, I should update this article.

@nh2
Copy link
Member

nh2 commented Nov 9, 2023

So, our gracefull close is NOT suffering from CLOSE_WAIT.

@kazu-yamamoto Because my server is suffering from 3000 CLOSE_WAITs a couple times per week; I'm on:

network-3.1.2.9
wai-3.2.3
warp-3.3.23

@kazu-yamamoto
Copy link
Collaborator

@nh2 Understood.
If you stop using gracefulClose, does CLOSE_WAIT disappear?

@nh2
Copy link
Member

nh2 commented Oct 21, 2024

I haven't figured out yet why my warp application stops accept()ing for multiple minutes, creating the CLOSE_WAITs.

An update on this:

My application was calling unsafe FFI to process data coming from an mmap (to hash a 100 GB file). That is illegal, because unsafe foreign calls must only be used on extremely short-running functions. Otherwise the entire Haskell process blocks during GC. This happened to me, so my entire process blocked for 20 minutes (until the hashing of 100 GB completed).

This of course caused my process to stop calling any function, including accept(), and thus the CLOSE_WAITs accumulated (I have automated HTTP monitoring that queries my server every second to see if it's still up; if it doesn't reply within a 2 second timeout, my monitoring client disconnets, and so its un-accept()ed connection sits in the kernel's listen queue as CLOSE_WAIT, and every second a new one got added).

You can read more about it here:

It was difficult to figure out because mmap access does not show up in strace. Avoid mmap when you can, it's invisible and thus hard to debug!

Also scrutinise any libraries for unsafe calls, even if they only do memcpy; if they accept an arbitrary-sized ByteString, they will block your entire process until they are done.


This does not imply that there are no further buts in network or warp, just that I found one certain cause of the Haskell process freezing that was a problem in my application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants