Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KeepAlive support #1648

Closed
mwitkow opened this issue Apr 8, 2016 · 21 comments
Closed

Add KeepAlive support #1648

mwitkow opened this issue Apr 8, 2016 · 21 comments
Assignees
Milestone

Comments

@mwitkow
Copy link

mwitkow commented Apr 8, 2016

With our sue of gRPC Java across Google Compute Engine (GCE) L3 Load Balancers (Network Load Balancers), we seem to be hitting similar issues we had with gRPC in Go:
grpc/grpc-go#536

Basically Google L3 load balancers silently drop long-lasting TCP connections after 600 seconds.

While we were able to work around the issue by specifying a custom Dialer in Go:

func WithKeepAliveDialer() grpc.DialOption {
    return grpc.WithDialer(func(addr string, timeout time.Duration) (net.Conn, error) {
        d := net.Dialer{Timeout: timeout, KeepAlive: *flagGrpcClientKeepAliveDuration}
        return d.Dial("tcp", addr)
    })
}

There seems to be no way of overriding the KeepAlive peridods for NettyClientTransport. We know it's possible to set the keep alive period in the kernel of the machines, but that's a bit of a stretch to expect the user-code programmers to know about it.

Can we either:

  • have the ability to specify the TCP keep alive period on create of channel
  • documentation around it, especially how it can cause hard-to-debug problems on GCE?

cc @ejona86 since he seems to have had opinions about it in #737

@ejona86
Copy link
Member

ejona86 commented Apr 8, 2016

I'm currently trying to push forward a cross-language grpc spec for HTTP/2 PING-based keep alives and have already been investigating this pretty deeply. I'm hoping to get something public "soon." The design leaves open the possibility of supporting TCP keepalives in addition, but it wasn't going to be an initial focus.

Go's TCP Keepalive leaves something to be desired because it does not expose enough knobs to solve all that we'd want keepalive to solve, namely, detecting broken connections in a timely fashion. Java's TCP keepalive is much weaker in that you can only turn it on; you'd need to use sysctl or similar to change the OS's default settings (as actually documented for the GCE LB).

@mwitkow
Copy link
Author

mwitkow commented Apr 8, 2016

Please do. HTTP/2 PING as a healthcheck signal would be invaluable when implementing something like an DNS-SRV based client-side load balancer.

Any ETA on these shenanigans? :>

@lukaszx0
Copy link
Collaborator

lukaszx0 commented Apr 8, 2016

@mwitkow FWIW, http/2 pings are already implemented in grpc-java.

We're just missing cross-language consensu on what exactly and how we want to do it.

@ejona86
Copy link
Member

ejona86 commented Apr 8, 2016

Here is an except from the document I'm trying to get agreement on:

TCP keepalive is hard to configure in Java and Go. Enabling is easy, but one hour is far too infrequent to be useful; an application-level keepalive seems beneficial for configuration.

TCP keepalive is active even if there are no open streams. This wastes a substantial amount of battery on mobile; an application-level keepalive seems beneficial for optimization.

Application-level keepalive implies HTTP/2 PING. If we take a page from TCP keepalive’s book there are three parameters to tune: time (time since last receipt before sending a keepalive), interval (interval between keepalives when not receiving reply), and retry (number of times to retry sending keepalives). Interval and retry don’t quite apply to PING because the transport is reliable, so they will be replaced with timeout (equivalent to interval * retry), the time between sending a PING and not receiving any bytes to declare the connection dead.

Doing some form of keepalive is relatively straightforward. But avoiding DDoS is not as easy. Thus, avoiding DDoS is the most important part of the design. To mitigate DDoS the design:

  • Disables keepalive for HTTP/2 connections with no outstanding streams, and
  • Enforces a lower limit to the keepalive delay, namely no less than one minute

Most RPCs are unary with quick replies, so keepalive is less likely to be triggered. It would primarily be triggered when there is a long-lived RPC.

Since keepalive is not occurring on HTTP/2 connections without any streams, there will be a higher chance of failure for new RPCs following a long period of inactivity. To reduce the tail latency for these RPCs, it is important to not reset the `keepalive time’ when a connection becomes active; if a new stream is created and there has been greater than ‘keepalive time’ since the last read byte, then a keepalive PING should be sent (ideally before the HEADERS frame). Doing so detects the broken connection with a latency of 'keepalive timeout’ instead of 'keepalive time + timeout’.

'keepalive time’ is ideally measured from the time of the last byte read. However, simplistic implementations may choose to measure from the time of the last keepalive PING (aka, polling). Such implementations should take extra precautions to avoid issues due to latency added by outbound buffers, such as limiting the outbound buffer size and using a larger 'keepalive timeout’.

As an optional optimization, when 'keepalive timeout’ is exceeded, don’t kill the connection. Instead, start a new connection. If the new connection becomes ready and the old connection still hasn’t received any bytes, then kill the old connection. If the old connection wins the race, then kill the new connection mid-startup.

The 'keepalive time’ is expected to be an application-configurable option, with at least second precision. It is unspecified whether 'keepalive timeout’ is application-configurable, but it should be at least multiple times the round-trip time to allow for lost packets and TCP retransmits. It may also need to be higher to account for long garbage collector pauses.

@lukaszx0
Copy link
Collaborator

lukaszx0 commented Apr 9, 2016

cc @jhump @ericzundel

@ejona86 ejona86 changed the title Add the ability to specify TCP KeepAlive periods to ClientTransport Add KeepAlive support Apr 19, 2016
@ejona86 ejona86 added this to the 1.0 milestone Apr 19, 2016
@ejona86 ejona86 self-assigned this Apr 19, 2016
@hsaliak hsaliak modified the milestones: 1.1, 1.0 Apr 26, 2016
@hsaliak hsaliak added the P1 label Apr 26, 2016
@zsurocking
Copy link
Contributor

@makdharma

@carl-mastrangelo carl-mastrangelo modified the milestones: 1.2, 1.1 Jan 13, 2017
@ziminer
Copy link

ziminer commented Feb 2, 2017

We're running into a similar problem with the docker-swarm load balancer closing the connection after 10 minutes of inactivity, causing subsequent RPC's to hang for long periods of time (in the order of minutes) before failing with NO_ROUTE_TO_HOST and similar errors. It sounds like the proposed fix won't help with this problem, because there will be no keepalives while the connection is idle. Is that correct?

Our current workaround for this is to write our own channel extending the NettyChannel that sends ping messages every X seconds. Is there an alternative way to deal with this issue in Java that would work out-of-the-box?

@lukaszx0
Copy link
Collaborator

@ejona86 I think we can close it? (#2366)

@bobwenx
Copy link

bobwenx commented Feb 15, 2017

the issue #2726 probably related to this new feature.

@m11y
Copy link

m11y commented Oct 23, 2017

@ejona86 Would you please share the design doc with us? Also, I am wondering should client or server init the keep alive request, which side is preferable?

@ejona86
Copy link
Member

ejona86 commented Oct 23, 2017

@smartwjw, see grpc/proposal#22 and grpc/proposal#23

Possibly both. If you need it to detect connection breakages, then it does need to be on both sides. It's fine to have both do keepalive; the keepalive from one will tend to count toward the keepalive for the other (so it doesn't really add overhead other than timer scheduling). Preventative keepalives tend to be best on the client, since different clients may need different settings. Detecting breakages breakages is a bit different on both sides: on client it is to notice the RPC isn't going to complete; on server it is to clean up garbage.

@m11y
Copy link

m11y commented Oct 24, 2017

@ejona86 Thanks for your answer.

I set keep alive time to 10 seconds on both side for testing, and use wireshark to capture ping activities between client and server, it shows that both the client and the server send ping every 10s and won't delay sending ping after receiving ping from other side.
image

@m11y
Copy link

m11y commented Oct 24, 2017

I change the keep alive time of client to 15s, it seems ping request always send from server, I think the problem above is a concurrency problem.

image

@ZedYu
Copy link

ZedYu commented Jan 30, 2018

I do not get why the server side should ping the client as well. From the client side, it can detect breakage by timed out requests.

@jhump
Copy link
Member

jhump commented Jan 30, 2018

@ZedYu, the server would ping the client so it can identify when the client has become unreachable and release resources associated with such "zombie" connections.

@ZedYu
Copy link

ZedYu commented Jan 30, 2018

Thanks @jhump for the answer. But inactive client connection can be detected by just timing intervals of the client pings. If the server does not receive a ping within sometime, the connection is broken. I really do not see an obvious advantage of bidirectional pings over client pings.

@jhump
Copy link
Member

jhump commented Jan 30, 2018

@ZedYu, true, but not all clients will necessarily ping. That strategy would work only in a perfectly homogenous environment where the server knew apriori how the client's ping interval is configured.

@mikestanley
Copy link

Checking in here - is the keep alive proposal grpc/proposal#22 supported in C based libraries yet? Running into connection resets in Ruby client and I believe it’s caused by idle connections and load balancer disconnects. The keep alive proposal seems to be what I’m lookong for. Just trying to understand the status since this particular ticket is closed.

@ejona86
Copy link
Member

ejona86 commented Apr 26, 2018

@mikestanley, yes, there is some keepalive support in C-based libraries. That wouldn't be tracked in this grpc/grpc-java repository though.

@mikestanley
Copy link

mikestanley commented Apr 27, 2018 via email

@ejona86
Copy link
Member

ejona86 commented Apr 27, 2018

@mikestanley, I suggest asking on the [email protected] mailing list or creating an issue at the grpc/grpc repository.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests