-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: reduce network timeouts #92542
base: reduce network timeouts #92542
Conversation
659cab3
to
71fd9eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @nvanbenschoten)
pkg/base/config.go
line 57 at r1 (raw file):
// variance), with a lower bound of 200ms, so the worst-case RTO is 600ms. 2s // should therefore be sufficient under most normal conditions. // https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB
Maybe add something similar for AWS, here's the p99 over one week map:
https://www.cloudping.co/grid/p_99/timeframe/1W
Looks like the numbers come out to around the same, between Sao Paulo and Singapore.
pkg/base/config.go
line 58 at r1 (raw file):
// should therefore be sufficient under most normal conditions. // https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB NetworkTimeout = 2 * time.Second
You will need to update this one too:
Line 1765 in 2d79db8
MinConnectTimeout: minConnectionTimeout})) |
Otherwise, a "black hole" dial will still take 5s. Probably NetworkTimeout should "just" reference rpc.MinConnectionTimeout
.
I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.
In terms of authority on TLS roundtrips, I found https://www.gnutls.org/manual/html_node/Reducing-round_002dtrips.html useful.
71fd9eb
to
e322aba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added an environment variable COCKROACH_NETWORK_TIMEOUT
to control this, even though 2s should be sufficient for most users. FWIW, 20.2 accidentally dropped the gRPC connection timeout to the default of 1s, and we only had reports from a single user that this was problematic (they had >500ms RTTs due to a network topology that routed Japan to South America via Europe).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)
pkg/base/config.go
line 57 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe add something similar for AWS, here's the p99 over one week map:
https://www.cloudping.co/grid/p_99/timeframe/1W
Looks like the numbers come out to around the same, between Sao Paulo and Singapore.
Yep, added this.
pkg/base/config.go
line 58 at r1 (raw file):
Good catch, thanks. I updated minConnectionTimeout
to reference NetworkTimeout
, since that's the canonical setting.
I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.
It possibly does. I added some analysis on this.
e322aba
to
acb8c67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gRPC dial timeout is also reduced to the network timeout (from 5 seconds to 2 seconds).
Does this include the TLS handshake?
2 seconds is very short, especially for geo-distributed clusters. With the syn, syn-ack, ack exchange before the TCP connection is established + 3 back-and-forth for the TLS handshake, at 250ms roundtrip you're already at the limit.
I'd be interested in @bdarnell 's opinion on this too.
I'm not sure, but I believe it might. The TCP+TLS handshake is 3 RTTs. Cloud providers have a max nominal RTT latency between regions of ~350ms, so if we assume 400ms RTT we still have time both for the TCP+TLS handshake and a single packet loss (600ms RTO): 3*400ms + 600ms = 1800ms. FWIW, we accidentally ran with a 1 second gRPC dial timeout in 20.2, 21.1, and 21.2, with only a single reported problem because of it (see #71417). |
Confirmed, the timeout includes the TLS handshake. I'll run some further tests to verify when I get a chance (including a fake high-latency network and inspecting the network traffic). |
A more conservative option here might be to keep the network timeout at 3 seconds (like today), but apply it to the RPC heartbeats (from 6 to 3 seconds with 1.5 second interval) and the gRPC dial timeout (from 5 to 3 seconds). I still think 2 seconds should be fine for most scenarios though. |
Any timeout that is long enough to allow for a full TCP+TLS handshake is going to be much longer than necessary for detecting a failure on an established connection. I know this is easier said than done, but the ideal answer would be to replace the undifferentiated In the meantime, I'm tentatively OK with moving to 2s by default, especially since we'd be introducing a new configuration knob in case it's not working. |
Yeah, I've been thinking along the same lines. We could drop |
I wouldn't even necessarily require room for a retransmit. Retransmit timeouts are pretty high by modern within-region standards (e.g. you mentioned a 200ms minimum above). It could be worthwhile to try another replica after, say, 100ms even though there's a chance a packet was dropped and hasn't been retransmitted yet. (but we shouldn't retry the same replica more than once faster than the retransmission timeout) |
Possibly, but I feel like we're already pretty aggressively lowering the timeouts here. I think I'd want to hold back a little bit and not go all in on this just yet, since it can be pretty fragile under latency fluctuations. We can consider tightening it further later, and even then we might want to try it out in CC for a while first. But the RTT factoring seems reasonable, I'll give that a try (i.e. set |
`NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts. Release note: None
acb8c67
to
a00caa2
Compare
a00caa2
to
715ae21
Compare
715ae21
to
5432207
Compare
I've restructured this to use As a consequence of these changes, we've reduced the dial timeout from 5 to 2 seconds, gRPC keepalive interval/timeout from 3 to 1 second, and RPC heartbeat interval/timeout from 3/6 seconds to 1 second. This is rather tight, so I'm open to bumping |
Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s. Release note: None
This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes. Release note: None
5432207
to
0f9d344
Compare
Had a quick look at our CC clusters in CentMon, and the worst p99 RPC heartbeat latency in any cluster over the past 90 days was 557 ms. This was on a single-region cluster in the US, and the high latency appeared to be due to CPU overload/throttling: 5 nodes with 2 vCPUs where several nodes were running close to 100% CPU, beyond the Kubernetes resource requests. So I don't think we ever want to drop this below 1 second, to make sure we're tolerant to node overload as well. Updated the |
0f9d344
to
6286ad0
Compare
I wrote up a roachtest to measure the pMax latency during leaseholder loss with a network blackhole as well (#92991), and the results are promising. These sampled 9 outages each, and show a fair bit of variance, but it's a clear improvement nonetheless.
Notably, the marginal gains of a 1 second vs. 2 second network timeout are fairly low, and may not be worth the instability risk, at least not for this pass. |
6286ad0
to
e5e5913
Compare
This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds). When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages. An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced to tweak this timeout if needed. Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to `2s`.
e5e5913
to
65a2bc3
Compare
Build succeeded: |
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <[email protected]>
*: don't use
NetworkTimeout
where inappropriateNetworkTimeout
should be used for network roundtrip timeouts, not for request processing timeouts.Release note: None
rpc: unify heartbeat interval and timeout
Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s.
Release note: None
base: add
DialTimeout
This patch adds a
DialTimeout
constant, set to2 * NetworkTimeout
to account for the additional roundtrips in TCP + TLS handshakes.base: reduce network timeouts
This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds).
When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages.
An environment variable
COCKROACH_NETWORK_TIMEOUT
has been introduced to tweak this timeout if needed.Touches #79494.
Epic: None.
Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable
COCKROACH_NETWORK_TIMEOUT
which defaults to2s
.