Add KeepAlive support #1648

mwitkow · 2016-04-08T09:42:55Z

With our sue of gRPC Java across Google Compute Engine (GCE) L3 Load Balancers (Network Load Balancers), we seem to be hitting similar issues we had with gRPC in Go:
grpc/grpc-go#536

Basically Google L3 load balancers silently drop long-lasting TCP connections after 600 seconds.

While we were able to work around the issue by specifying a custom Dialer in Go:

func WithKeepAliveDialer() grpc.DialOption {
    return grpc.WithDialer(func(addr string, timeout time.Duration) (net.Conn, error) {
        d := net.Dialer{Timeout: timeout, KeepAlive: *flagGrpcClientKeepAliveDuration}
        return d.Dial("tcp", addr)
    })
}

There seems to be no way of overriding the KeepAlive peridods for NettyClientTransport. We know it's possible to set the keep alive period in the kernel of the machines, but that's a bit of a stretch to expect the user-code programmers to know about it.

Can we either:

have the ability to specify the TCP keep alive period on create of channel
documentation around it, especially how it can cause hard-to-debug problems on GCE?

cc @ejona86 since he seems to have had opinions about it in #737

The text was updated successfully, but these errors were encountered:

ejona86 · 2016-04-08T16:32:49Z

I'm currently trying to push forward a cross-language grpc spec for HTTP/2 PING-based keep alives and have already been investigating this pretty deeply. I'm hoping to get something public "soon." The design leaves open the possibility of supporting TCP keepalives in addition, but it wasn't going to be an initial focus.

Go's TCP Keepalive leaves something to be desired because it does not expose enough knobs to solve all that we'd want keepalive to solve, namely, detecting broken connections in a timely fashion. Java's TCP keepalive is much weaker in that you can only turn it on; you'd need to use sysctl or similar to change the OS's default settings (as actually documented for the GCE LB).

mwitkow · 2016-04-08T18:16:36Z

Please do. HTTP/2 PING as a healthcheck signal would be invaluable when implementing something like an DNS-SRV based client-side load balancer.

Any ETA on these shenanigans? :>

lukaszx0 · 2016-04-08T18:22:13Z

@mwitkow FWIW, http/2 pings are already implemented in grpc-java.

We're just missing cross-language consensu on what exactly and how we want to do it.

ejona86 · 2016-04-08T20:05:19Z

Here is an except from the document I'm trying to get agreement on:

TCP keepalive is hard to configure in Java and Go. Enabling is easy, but one hour is far too infrequent to be useful; an application-level keepalive seems beneficial for configuration.

TCP keepalive is active even if there are no open streams. This wastes a substantial amount of battery on mobile; an application-level keepalive seems beneficial for optimization.

Application-level keepalive implies HTTP/2 PING. If we take a page from TCP keepalive’s book there are three parameters to tune: time (time since last receipt before sending a keepalive), interval (interval between keepalives when not receiving reply), and retry (number of times to retry sending keepalives). Interval and retry don’t quite apply to PING because the transport is reliable, so they will be replaced with timeout (equivalent to interval * retry), the time between sending a PING and not receiving any bytes to declare the connection dead.

Doing some form of keepalive is relatively straightforward. But avoiding DDoS is not as easy. Thus, avoiding DDoS is the most important part of the design. To mitigate DDoS the design:

Disables keepalive for HTTP/2 connections with no outstanding streams, and
Enforces a lower limit to the keepalive delay, namely no less than one minute

Most RPCs are unary with quick replies, so keepalive is less likely to be triggered. It would primarily be triggered when there is a long-lived RPC.

Since keepalive is not occurring on HTTP/2 connections without any streams, there will be a higher chance of failure for new RPCs following a long period of inactivity. To reduce the tail latency for these RPCs, it is important to not reset the `keepalive time’ when a connection becomes active; if a new stream is created and there has been greater than ‘keepalive time’ since the last read byte, then a keepalive PING should be sent (ideally before the HEADERS frame). Doing so detects the broken connection with a latency of 'keepalive timeout’ instead of 'keepalive time + timeout’.

'keepalive time’ is ideally measured from the time of the last byte read. However, simplistic implementations may choose to measure from the time of the last keepalive PING (aka, polling). Such implementations should take extra precautions to avoid issues due to latency added by outbound buffers, such as limiting the outbound buffer size and using a larger 'keepalive timeout’.

As an optional optimization, when 'keepalive timeout’ is exceeded, don’t kill the connection. Instead, start a new connection. If the new connection becomes ready and the old connection still hasn’t received any bytes, then kill the old connection. If the old connection wins the race, then kill the new connection mid-startup.

The 'keepalive time’ is expected to be an application-configurable option, with at least second precision. It is unspecified whether 'keepalive timeout’ is application-configurable, but it should be at least multiple times the round-trip time to allow for lost packets and TCP retransmits. It may also need to be higher to account for long garbage collector pauses.

lukaszx0 · 2016-04-09T13:07:30Z

cc @jhump @ericzundel

zsurocking · 2016-06-21T18:44:59Z

@makdharma

ziminer · 2017-02-02T17:44:47Z

We're running into a similar problem with the docker-swarm load balancer closing the connection after 10 minutes of inactivity, causing subsequent RPC's to hang for long periods of time (in the order of minutes) before failing with NO_ROUTE_TO_HOST and similar errors. It sounds like the proposed fix won't help with this problem, because there will be no keepalives while the connection is idle. Is that correct?

Our current workaround for this is to write our own channel extending the NettyChannel that sends ping messages every X seconds. Is there an alternative way to deal with this issue in Java that would work out-of-the-box?

lukaszx0 · 2017-02-10T07:52:48Z

@ejona86 I think we can close it? (#2366)

bobwenx · 2017-02-15T02:29:59Z

the issue #2726 probably related to this new feature.

m11y · 2017-10-23T12:51:33Z

@ejona86 Would you please share the design doc with us? Also, I am wondering should client or server init the keep alive request, which side is preferable?

ejona86 · 2017-10-23T16:48:15Z

@smartwjw, see grpc/proposal#22 and grpc/proposal#23

Possibly both. If you need it to detect connection breakages, then it does need to be on both sides. It's fine to have both do keepalive; the keepalive from one will tend to count toward the keepalive for the other (so it doesn't really add overhead other than timer scheduling). Preventative keepalives tend to be best on the client, since different clients may need different settings. Detecting breakages breakages is a bit different on both sides: on client it is to notice the RPC isn't going to complete; on server it is to clean up garbage.

m11y · 2017-10-24T09:03:38Z

@ejona86 Thanks for your answer.

I set keep alive time to 10 seconds on both side for testing, and use wireshark to capture ping activities between client and server, it shows that both the client and the server send ping every 10s and won't delay sending ping after receiving ping from other side.

m11y · 2017-10-24T09:21:27Z

I change the keep alive time of client to 15s, it seems ping request always send from server, I think the problem above is a concurrency problem.

ZedYu · 2018-01-30T02:42:57Z

I do not get why the server side should ping the client as well. From the client side, it can detect breakage by timed out requests.

jhump · 2018-01-30T02:52:17Z

@ZedYu, the server would ping the client so it can identify when the client has become unreachable and release resources associated with such "zombie" connections.

ZedYu · 2018-01-30T02:58:47Z

Thanks @jhump for the answer. But inactive client connection can be detected by just timing intervals of the client pings. If the server does not receive a ping within sometime, the connection is broken. I really do not see an obvious advantage of bidirectional pings over client pings.

jhump · 2018-01-30T03:28:28Z

@ZedYu, true, but not all clients will necessarily ping. That strategy would work only in a perfectly homogenous environment where the server knew apriori how the client's ping interval is configured.

mikestanley · 2018-04-26T21:15:31Z

Checking in here - is the keep alive proposal grpc/proposal#22 supported in C based libraries yet? Running into connection resets in Ruby client and I believe it’s caused by idle connections and load balancer disconnects. The keep alive proposal seems to be what I’m lookong for. Just trying to understand the status since this particular ticket is closed.

ejona86 · 2018-04-26T23:11:38Z

@mikestanley, yes, there is some keepalive support in C-based libraries. That wouldn't be tracked in this grpc/grpc-java repository though.

mikestanley · 2018-04-27T20:20:37Z

Thanks Eric. Any pointers to the config variables for keep alive in the C-based library? I know it’s now off topic for this thread but I haven’t had much luck finding much else on the subject.

…

On Thu, Apr 26, 2018 at 7:11 PM Eric Anderson ***@***.***> wrote: @mikestanley <https://github.com/mikestanley>, yes, there is some keepalive support in C-based libraries. That wouldn't be tracked in this grpc/grpc-java repository though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAF-OLsBIcjuoLH2KsJ9AdhZN8GeKUiHks5tslQ2gaJpZM4IC2PV> .

ejona86 · 2018-04-27T20:26:49Z

@mikestanley, I suggest asking on the [email protected] mailing list or creating an issue at the grpc/grpc repository.

ejona86 changed the title ~~Add the ability to specify TCP KeepAlive periods to ClientTransport~~ Add KeepAlive support Apr 19, 2016

ejona86 added this to the 1.0 milestone Apr 19, 2016

ejona86 mentioned this issue Apr 19, 2016

Do I need to use common-pools to wrap ManagedChannel #1636

Closed

ejona86 self-assigned this Apr 19, 2016

hsaliak modified the milestones: 1.1, 1.0 Apr 26, 2016

hsaliak added the P1 label Apr 26, 2016

joelnn mentioned this issue May 12, 2016

Watch endpoint should have a timeout option etcd-io/etcd#2468

Closed

mwitkow mentioned this issue May 16, 2016

Best practices for reusing connections, concurrency grpc/grpc-go#682

Closed

zsurocking modified the milestones: 1.0, 1.1 Jun 23, 2016

zsurocking self-assigned this Jun 23, 2016

ejona86 mentioned this issue Jun 24, 2016

Keepalive in OkHttp #1972

Closed

ejona86 unassigned zsurocking Jun 24, 2016

ejona86 modified the milestones: 1.1, 1.0 Jun 24, 2016

ejona86 removed the P1 label Jun 24, 2016

AmandaCameron mentioned this issue Jun 25, 2016

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking grpc/grpc-go#734

Closed

timanovsky mentioned this issue Sep 15, 2016

Datastore: infrequent operations always fail first time, requires retry googleapis/google-cloud-ruby#899

Closed

ghost mentioned this issue Oct 7, 2016

Keep Alive Support for Objective-C grpc/grpc#8328

Closed

ejona86 mentioned this issue Oct 14, 2016

Allow servers to enter lameduck state grpc/grpc#8363

Closed

ejona86 assigned lukaszx0 and unassigned ejona86 Oct 20, 2016

ejona86 mentioned this issue Nov 16, 2016

if no request for a long time, server died? grpc/grpc#5468

Closed

carl-mastrangelo modified the milestones: 1.2, 1.1 Jan 13, 2017

carl-mastrangelo modified the milestones: Next, 1.2 Mar 16, 2017

ejona86 closed this as completed May 12, 2017

ejona86 modified the milestones: 1.3, Next Jun 1, 2017

michaeledgar mentioned this issue Aug 20, 2017

Bazel exit code 38 on long builds communicating with Build Event Service bazelbuild/bazel#3570

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KeepAlive support #1648

Add KeepAlive support #1648

mwitkow commented Apr 8, 2016

ejona86 commented Apr 8, 2016

mwitkow commented Apr 8, 2016

lukaszx0 commented Apr 8, 2016

ejona86 commented Apr 8, 2016

lukaszx0 commented Apr 9, 2016

zsurocking commented Jun 21, 2016

ziminer commented Feb 2, 2017

lukaszx0 commented Feb 10, 2017

bobwenx commented Feb 15, 2017

m11y commented Oct 23, 2017

ejona86 commented Oct 23, 2017

m11y commented Oct 24, 2017 •

edited

Loading

m11y commented Oct 24, 2017

ZedYu commented Jan 30, 2018 •

edited

Loading

jhump commented Jan 30, 2018

ZedYu commented Jan 30, 2018 •

edited

Loading

jhump commented Jan 30, 2018

mikestanley commented Apr 26, 2018

ejona86 commented Apr 26, 2018

mikestanley commented Apr 27, 2018 via email

ejona86 commented Apr 27, 2018

Add KeepAlive support #1648

Add KeepAlive support #1648

Comments

mwitkow commented Apr 8, 2016

ejona86 commented Apr 8, 2016

mwitkow commented Apr 8, 2016

lukaszx0 commented Apr 8, 2016

ejona86 commented Apr 8, 2016

lukaszx0 commented Apr 9, 2016

zsurocking commented Jun 21, 2016

ziminer commented Feb 2, 2017

lukaszx0 commented Feb 10, 2017

bobwenx commented Feb 15, 2017

m11y commented Oct 23, 2017

ejona86 commented Oct 23, 2017

m11y commented Oct 24, 2017 • edited Loading

m11y commented Oct 24, 2017

ZedYu commented Jan 30, 2018 • edited Loading

jhump commented Jan 30, 2018

ZedYu commented Jan 30, 2018 • edited Loading

jhump commented Jan 30, 2018

mikestanley commented Apr 26, 2018

ejona86 commented Apr 26, 2018

mikestanley commented Apr 27, 2018 via email

ejona86 commented Apr 27, 2018

m11y commented Oct 24, 2017 •

edited

Loading

ZedYu commented Jan 30, 2018 •

edited

Loading

ZedYu commented Jan 30, 2018 •

edited

Loading