-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: reconsider gRPC server-side keepalive timeout #109317
Comments
109186: pkg/util/log: flush buffered network sinks on panic r=knz a=abarganier Previously, our crash reporter system would flush file log sinks as part of the process to handle a panic. This was an incomplete process, since buffered network sinks were not included in part of this flush process. This means that many times, panic logs would not make it to the network target, leading to a loss in observability. This patch introduces `log.FlushAllSync()`, which flushes both file and buffered network log sinks. It then updates the crash reporter to call into this, instead of just flushing file log sinks. `FlushAllSync()` contains timeout logic to prevent the process from completing if one of the underlying child sinks that a bufferedSink wraps becomes unavailable/hangs on its `output()` call. We originally attempted to fix this in #101562, but a bug in the bufferedSink code led us to roll back those changes. The bug in the bufferedSink code has since been fixed (#108928), so we can safely introduce this logic again. Release note: none Fixes: #106345 109578: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout r=andrewbaptist a=erikgrinaker This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc). --- This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket. While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer. Touches #109317. Epic: none Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT. 109610: kv: remove assertions around non-txn'al locking reqs r=nvanbenschoten a=nvanbenschoten Closes #107860. Closes #109222. Closes #109581. Closes #109582. We might want to re-introduce these assertions in the future and reject these requests higher up the stack. For now, just remove them to deflake tests. Release note: None Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Turns out the timeout is also used for gRPC's own keepalive pings, which unlike TCP_USER_TIMEOUT is subject to scheduling latency and such. This explains why we saw this in cases such as #108157. |
We increased the server-side timeout in #109578 (2 to 4 seconds) and the keepalive ping interval in #109715 (1 to 2 seconds). However, we see a possible regression in Consider a client C sending a request to the partitioned leaseholder L. C will normally have to wait out the RPC heartbeat timeout (6 seconds) before failing and trying someone else. However, L's server-side timeout would have fired after 3 seconds (1s keepalive interval + 2s server-side timeout), closing the connection and forcing the client elsewhere. It can do this since the outbound TCP RST packets can still make it through (only inbound packets C->L are dropped). Thus, increasing the server-side timeout will increase the time until the TCP RST is sent, and the client will have to fall back to the RPC heartbeat timeout, increasing the overall recovery time. Considering the timeout is sensitive to node overload and Go runtime scheduling latencies and such, and this kind of asymmetric partition is relatively rare, this seems like a justified tradeoff. |
109715: rpc: increase gRPC server keepalive interval to 2x PingInterval r=erikgrinaker a=erikgrinaker This patch increases the gRPC server keepalive interval from 1x to 2x PingInterval. These keepalive pings are used both to keep the connection alive, and to detect and close failed connections from the server-side. Keepalive pings are only sent and checked if there is no other activity on the connection. We set this to 2x PingInterval, since we expect RPC heartbeats to be sent regularly (obviating the need for the keepalive ping), and there's no point sending keepalive pings until we've waited long enough for the RPC heartbeat to show up. An environment variable COCKROACH_RPC_SERVER_KEEPALIVE_INTERVAL is also added to tune this. Touches #109317. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>
The gRPC server-side keepalive timeout defaults to 2 seconds, via
COCKROACH_NETWORK_TIMEOUT
. This setsTCP_USER_TIMEOUT
on the socket, instructing the OS to close connections if a send remains unacknowledged by the peer for 2 seconds.We would ideally want this on the gRPC client side instead, since that would more quickly detect network outages and allow the client to fail over to a different node, reducing the duration of unavailability. Unfortunately, gRPC does not allow client-side keepalive timeouts below 10 seconds. Instead, we use application-level client-side RPC heartbeats, but unlike
TCP_USER_TIMEOUT
these are subject to head-of-line blocking in the RPC stack (including effects like scheduling latency). We therefore have to use a large RPC heartbeat timeout of 6 seconds to tolerate heavy loads in high-latency clusters (see #93397).Given this, it isn't clear that an aggressive server-side gRPC timeout provides much benefit. It doesn't help in the case of a network outage -- the server is unlikely to care about the connection remaining open, and the client has to wait for its own timeout to fire. It would more quickly detect asymmetric partitions, allowing the dialback mechanism to promote it to a symmetric partition (see #94778), but this is likely rare enough that a higher timeout is ok. Under persistently high latency it might be helpful to eagerly close the connection and let the client fail over to a lower-latency node, but this assumes the lease also moves and that the client won't block and re-dial the node (mostly true in 23.2 with #99191). It can also be detrimental since it can cause additional disruption with transient latency spikes like intermittent packet loss.
We should reconsider the value for the server-side keepalive timeout, subject to availability benchmarks (especially with asymmetric partitions and latency spikes).
See also #93007.
cockroach/pkg/rpc/keepalive.go
Lines 26 to 47 in e97fe41
Jira issue: CRDB-30871
The text was updated successfully, but these errors were encountered: