-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KeepAlive support #1648
Comments
I'm currently trying to push forward a cross-language grpc spec for HTTP/2 PING-based keep alives and have already been investigating this pretty deeply. I'm hoping to get something public "soon." The design leaves open the possibility of supporting TCP keepalives in addition, but it wasn't going to be an initial focus. Go's TCP Keepalive leaves something to be desired because it does not expose enough knobs to solve all that we'd want keepalive to solve, namely, detecting broken connections in a timely fashion. Java's TCP keepalive is much weaker in that you can only turn it on; you'd need to use sysctl or similar to change the OS's default settings (as actually documented for the GCE LB). |
Please do. HTTP/2 PING as a healthcheck signal would be invaluable when implementing something like an DNS-SRV based client-side load balancer. Any ETA on these shenanigans? :> |
@mwitkow FWIW, http/2 pings are already implemented in grpc-java. We're just missing cross-language consensu on what exactly and how we want to do it. |
Here is an except from the document I'm trying to get agreement on: TCP keepalive is hard to configure in Java and Go. Enabling is easy, but one hour is far too infrequent to be useful; an application-level keepalive seems beneficial for configuration. TCP keepalive is active even if there are no open streams. This wastes a substantial amount of battery on mobile; an application-level keepalive seems beneficial for optimization. Application-level keepalive implies HTTP/2 PING. If we take a page from TCP keepalive’s book there are three parameters to tune: time (time since last receipt before sending a keepalive), interval (interval between keepalives when not receiving reply), and retry (number of times to retry sending keepalives). Interval and retry don’t quite apply to PING because the transport is reliable, so they will be replaced with timeout (equivalent to interval * retry), the time between sending a PING and not receiving any bytes to declare the connection dead. Doing some form of keepalive is relatively straightforward. But avoiding DDoS is not as easy. Thus, avoiding DDoS is the most important part of the design. To mitigate DDoS the design:
Most RPCs are unary with quick replies, so keepalive is less likely to be triggered. It would primarily be triggered when there is a long-lived RPC. Since keepalive is not occurring on HTTP/2 connections without any streams, there will be a higher chance of failure for new RPCs following a long period of inactivity. To reduce the tail latency for these RPCs, it is important to not reset the `keepalive time’ when a connection becomes active; if a new stream is created and there has been greater than ‘keepalive time’ since the last read byte, then a keepalive PING should be sent (ideally before the HEADERS frame). Doing so detects the broken connection with a latency of 'keepalive timeout’ instead of 'keepalive time + timeout’. 'keepalive time’ is ideally measured from the time of the last byte read. However, simplistic implementations may choose to measure from the time of the last keepalive PING (aka, polling). Such implementations should take extra precautions to avoid issues due to latency added by outbound buffers, such as limiting the outbound buffer size and using a larger 'keepalive timeout’. As an optional optimization, when 'keepalive timeout’ is exceeded, don’t kill the connection. Instead, start a new connection. If the new connection becomes ready and the old connection still hasn’t received any bytes, then kill the old connection. If the old connection wins the race, then kill the new connection mid-startup. The 'keepalive time’ is expected to be an application-configurable option, with at least second precision. It is unspecified whether 'keepalive timeout’ is application-configurable, but it should be at least multiple times the round-trip time to allow for lost packets and TCP retransmits. It may also need to be higher to account for long garbage collector pauses. |
We're running into a similar problem with the docker-swarm load balancer closing the connection after 10 minutes of inactivity, causing subsequent RPC's to hang for long periods of time (in the order of minutes) before failing with NO_ROUTE_TO_HOST and similar errors. It sounds like the proposed fix won't help with this problem, because there will be no keepalives while the connection is idle. Is that correct? Our current workaround for this is to write our own channel extending the NettyChannel that sends ping messages every X seconds. Is there an alternative way to deal with this issue in Java that would work out-of-the-box? |
the issue #2726 probably related to this new feature. |
@ejona86 Would you please share the design doc with us? Also, I am wondering should client or server init the keep alive request, which side is preferable? |
@smartwjw, see grpc/proposal#22 and grpc/proposal#23 Possibly both. If you need it to detect connection breakages, then it does need to be on both sides. It's fine to have both do keepalive; the keepalive from one will tend to count toward the keepalive for the other (so it doesn't really add overhead other than timer scheduling). Preventative keepalives tend to be best on the client, since different clients may need different settings. Detecting breakages breakages is a bit different on both sides: on client it is to notice the RPC isn't going to complete; on server it is to clean up garbage. |
@ejona86 Thanks for your answer. I set keep alive time to 10 seconds on both side for testing, and use wireshark to capture |
I do not get why the server side should ping the client as well. From the client side, it can detect breakage by timed out requests. |
@ZedYu, the server would ping the client so it can identify when the client has become unreachable and release resources associated with such "zombie" connections. |
Thanks @jhump for the answer. But inactive client connection can be detected by just timing intervals of the client pings. If the server does not receive a ping within sometime, the connection is broken. I really do not see an obvious advantage of bidirectional pings over client pings. |
@ZedYu, true, but not all clients will necessarily ping. That strategy would work only in a perfectly homogenous environment where the server knew apriori how the client's ping interval is configured. |
Checking in here - is the keep alive proposal grpc/proposal#22 supported in C based libraries yet? Running into connection resets in Ruby client and I believe it’s caused by idle connections and load balancer disconnects. The keep alive proposal seems to be what I’m lookong for. Just trying to understand the status since this particular ticket is closed. |
@mikestanley, yes, there is some keepalive support in C-based libraries. That wouldn't be tracked in this grpc/grpc-java repository though. |
Thanks Eric. Any pointers to the config variables for keep alive in the
C-based library? I know it’s now off topic for this thread but I haven’t
had much luck finding much else on the subject.
…On Thu, Apr 26, 2018 at 7:11 PM Eric Anderson ***@***.***> wrote:
@mikestanley <https://github.com/mikestanley>, yes, there is some
keepalive support in C-based libraries. That wouldn't be tracked in this
grpc/grpc-java repository though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAF-OLsBIcjuoLH2KsJ9AdhZN8GeKUiHks5tslQ2gaJpZM4IC2PV>
.
|
@mikestanley, I suggest asking on the [email protected] mailing list or creating an issue at the grpc/grpc repository. |
With our sue of gRPC Java across Google Compute Engine (GCE) L3 Load Balancers (Network Load Balancers), we seem to be hitting similar issues we had with gRPC in Go:
grpc/grpc-go#536
Basically Google L3 load balancers silently drop long-lasting TCP connections after
600
seconds.While we were able to work around the issue by specifying a custom Dialer in Go:
There seems to be no way of overriding the KeepAlive peridods for
NettyClientTransport
. We know it's possible to set the keep alive period in the kernel of the machines, but that's a bit of a stretch to expect the user-code programmers to know about it.Can we either:
cc @ejona86 since he seems to have had opinions about it in #737
The text was updated successfully, but these errors were encountered: