-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP Connection reset after a single lost keepalive packet #6250
Comments
Our keepalive implementation is based on proposals A8 and A9. Any changes here would need to made across grpc implementations. If you want to propose any changes to this spec, you can do on the grpc-proposal repository. Lost TCP keepalive packets might not be re-transmitted. But gRPC keepalives use a HTTP/2 PING frame. Why are these getting lost and not retransmitted? Another thing to note is that gRPC sets the TCP keepalive timeout only when gRPC keepalive Can you share you gRPC keepalive configuration on the client and server side? Thanks. |
Other grpc implementations are most likely not affected by the same issue. The Design Background of A8 states that tcp keepalives are With this high interval, tcp keepalives are likely never been send out because other keepalive mechanisms in grpc/http2 send data more regularly. These keealives are normal tcp packets and are been re-transmitted when a single packet drops. But TCP connections on GO default to a very short keepalive interval of just 15 seconds. Therefore this issue is only present in the go grpc implementation. I am using the default keepalive settings on both the server and client side. Thats why i think that this issue should at least be documented, or better "fixed" by setting more forgiving default settings. |
Thanks for the explanation @Lucaber Do you see the problem even in the case where you set both And since you are seeing the TCP connection getting closed after a single lost keepalive frame, are you saying that |
Correct, This depends on the values configured. The client and server configuration can mostly be viewed independently as the same issue is happening in both directions. Setting the Setting the
Another way to fix this problem is disabling (or setting to 2 hours) the tcp keepalive interval. This can only be done be the user because grpc-go does not initialize the tcp socket:
|
This can be controlled by setting the If the client sets Apart from documenting this, do you have any other suggestions for handling this issue? Thanks. |
When setting TCP_USER_TIMEOUT to the default value of 20 seconds, the tcp connection resets after only a single lost keepalive packet. This is the result of golang tcp sockets defaulting to a tcp keepalive interval of just 15 seconds.
When setting TCP_USER_TIMEOUT to the default value of 20 seconds, the tcp connection resets after only a single lost keepalive packet. This is the result of golang tcp sockets defaulting to a tcp keepalive interval of just 15 seconds.
Sorry, I just noticed that the default Otherwise correct, setting the Im not sure how to handle this problem correctly. The best way to solve this issue according to the spec is setting the correct TCP keepalive interval. But this can only be set by the user explicitly and is not controlled by the library. Another way would be to increase the After investigating the issue a bit more, i found out that it is actually possible to change the tcp keepalive settings after the connection is already established. Meaning we are able to disable tcp keepalives when grpc/http2 keepalives are enabled. |
Any updates on this issue? The mentioned commit seams to be the best option to handle this problem without requiring an action by the user |
We have several people out of the office for an extended amount of time, so this is unlikely to get any attention for a few weeks, sorry. |
Sorry, haven't been able to get back to this. Will definitely pick this up this week. |
Go TCP support does the following by default when creating a TCP connection:
Now, if gRPC keepalives are configured, the
IIUC correctly, if the
And because gRPC has set @ejona86 : Do you have any thoughts on how to handle this? |
TCP keepalive? 15s?! That's really aggressive. That's quite different than I remember when I looked at Go before. I think that is likely a Go regression. Seems it was caused by golang/go@fbf763f ? Go's configuration for keepalive is partly counter-productive[1] and we were purposefully not changing it. We should have just been setting "use TCP keepalive" but using the OS defaults. If we can't use OS defaults any more, then Go has broken us.
|
Looks like a yes: golang/go#48622 |
Are you saying, we should have just used TCP keepalives instead of implementing gRPC level keepalives? Or are you saying, we should have set TCP keepalive options instead of setting Also, what are your thoughts about disabling TCP keepalives when gRPC keepalives are enabled? This is what I'm taking about. |
I'm saying:
(2) and (3) aren't documented in any gRFC I believe. (2) was done as part of the ALTS work, and it does yield better behavior for all folks. (3) is something that Java's been doing forever, and still do independent of (1). Even with (1), (3) is still useful as some people configure their OS to use more aggressive settings for their specific environment (e.g., AWS). For those environments (3) behaves well without any extra work from users. And if you aren't in such an environment, then (3) is very low or zero cost. |
(1) We currently do implement HTTP/2 level keepalives as described in A8 and A9. (2) We currently also set (3) The TCP implementation in Go currently enables TCP keepalives by default unless explicitly specified by the user to disable it (either via a Dialer config or via a Listener config). The slight nuance here is that on the server side, we end up setting
So, even after (1) (2) and (3), if a user enables gRPC keepalives on the client and sets
Would it make sense to at least set the TCP keepalive time to whatever is configured using |
Looks like in Java we only set TCP_USER_TIMEOUT on client-side. I make no argument whether that is a good idea or not.
No. This is the broken Go behavior. Bad things will happen. We don't want that. We file a bug and get them to fix it. And then we can consider workarounds for the current version once they respond. |
Proposal golang/go#62254 has been created to enable setting of TCP keepalive time and interval separately. Doesn't affect us directly since we were not setting TCP keepalive socket options explicitly. We do have the option of disabling the setting of these TCP keepalive values by Go to the default of
|
It would be nice if @Lucaber or @JaydenTeoh are able to reproduce the original issue and then confirm that the fix in #6672 truly fixes the issue. Maybe @Lucaber can confirm? |
Also it appears the fix is incomplete. We need to make sure we're setting the keepalive socket option as appropriate (see #6672 (comment)). |
@JaydenTeoh -- Did you want to patch the fix for #6672 (comment)? |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
@arvindbr8 Sorry on the delay. I'll be happy to work on the fix! Do give me slightly more than a week because I am currently caught up in finals if that is okay! |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
It seems this behavior change may cause a regression for some use cases, so I think we'll need to prioritize this higher and get it done and backported to the release branch before the release. |
@dfawley : Based on our offline discussion, I sent out a PR to fix things on the client-side. For the server side, the current recommended approach of asking the user to supply a There is a Instead my suggestion would be as follows:
|
I think we should mention this to the user to do it in their listener (via a wrapper) if they care, and not do it ourselves. Or we should have a ServerOption that causes us to do this. But the default behavior is probably fine for most users and not worth trying to override. We could end up overriding their other settings by doing this. |
Makes sense (for the user to do it themselves in their Listener). Will actually add a commit to the existing PR which simply updates the comment for the server side. Thanks. |
Fixed by #6834. |
What version of gRPC are you using?
1.54.0
What version of Go are you using (
go version
)?go1.20.2 linux/amd64
What operating system (Linux, Windows, …) and version?
Linux 6.2
Bug Report
The configuration of
TCP_USER_TIMEOUT
to 20 seconds in #2307 and #5219 together with go default tcp keepalive interval of just 15 seconds results in a tcp connection being reset after a single lost keepalive packet.Lost packets of a tcp connection are normally being re-transmitted after a short amount of time, well within the 20 seconds timeout. But tcp keepalive packets are not being re-transmitted (ACK segments that contain no data are not reliably transmitted by TCP). Therefore the timeout is reached after just a single lost packet.
Normally not re-transmitting tcp keepalive packets is fine as the connection is only reseted after
TCP_KEEPCNT
(default=9) lost keepalive packets.Test
TCPDump of a test grpc connection (disabled keepalives on the client to reduce packet count, the same issue can be reproduced with default keepalives on both the server and client):
Increasing the
TCP_USER_TIMEOUT
to 50 seconds results in the connection only being reset after 3 lost keepalive packets.Note: Here "keepalive" refers to the grpc/http2 keepalive mechanism and timeout configuration, not the tcp keepalives.
The text was updated successfully, but these errors were encountered: