Skip to content

Commit

Permalink
base: reduce network timeouts
Browse files Browse the repository at this point in the history
This patch reduces the network timeout from 3 seconds to 2 seconds. This
change also affects gRPC keepalive intervals/timeouts.

Furthermore, the RPC heartbeat interval is now reduced to half of the
network timeout (from 3 seconds to 1 second), with a timeout equal to
the network timeout (from 6 seconds to 2 seconds). The gRPC dial timeout
is also reduced to the network timeout (from 5 seconds to 2 seconds).

When a peer is unresponsive, these timeouts determine how quickly RPC
calls (and thus critical operations such as lease acquisitions) will be
retried against a different node. Reducing them therefore improves
recovery time during infrastructure outages.

An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced
to tweak this timeout if needed.

Release note (ops change): The network timeout for RPC connections
between cluster nodes has been reduced from 3 seconds to 2 seconds, in
order to reduce unavailability and tail latencies during infrastructure
outages. This can now be changed via the environment variable
`COCKROACH_NETWORK_TIMEOUT` which is set to `2s`.
  • Loading branch information
erikgrinaker committed Nov 28, 2022
1 parent 1a6e9f8 commit acb8c67
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 8 deletions.
33 changes: 26 additions & 7 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,6 @@ const (
defaultSQLAddr = ":" + DefaultPort
defaultHTTPAddr = ":" + DefaultHTTPPort

// NetworkTimeout is the timeout used for network operations.
NetworkTimeout = 3 * time.Second

// defaultRaftTickInterval is the default resolution of the Raft timer.
defaultRaftTickInterval = 200 * time.Millisecond

Expand All @@ -66,10 +63,6 @@ const (
// each heartbeat.
defaultRaftHeartbeatIntervalTicks = 5

// defaultRPCHeartbeatInterval is the default value of RPCHeartbeatIntervalAndHalfTimeout
// used by the rpc context.
defaultRPCHeartbeatInterval = 3 * time.Second

// defaultRangeLeaseRenewalFraction specifies what fraction the range lease
// renewal duration should be of the range lease active time. For example,
// with a value of 0.2 and a lease duration of 10 seconds, leases would be
Expand Down Expand Up @@ -118,6 +111,32 @@ func DefaultHistogramWindowInterval() time.Duration {
}

var (
// NetworkTimeout is the timeout used for network operations.
//
// The maximum RTT between cloud regions is roughly 350ms both in GCP
// (asia-south2 to southamerica-west1) and AWS (af-south-1 to sa-east-1). It
// can occasionally be up to 500ms, but 400ms is a reasonable upper bound
// under nominal conditions.
// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB
// https://www.cloudping.co/grid/p_99/timeframe/1W
//
// Linux has an RTT-dependant retransmission timeout (RTO) which we can
// approximate as 1.5x RTT (smoothed RTT + 4x RTT variance), with a lower
// bound of 200ms, so the worst-case RTO is 750ms. A round trip can thus
// take 1.25s if a single packet is lost (750ms retransmit + 500ms RTT).
//
// Initial connection attempts can take 3 RTTs (TCP + TLS). On a high-latency
// link with 500ms RTT, a single lost packet will thus cause the connection to
// fail (1.5s + 750ms > 2s), but it will succeed under nominal high latencies
// of 400ms RTT (1.2s + 600ms < 2s). Failed connections will also be retried.
NetworkTimeout = envutil.EnvOrDefaultDuration("COCKROACH_NETWORK_TIMEOUT", 2*time.Second)

// defaultRPCHeartbeatInterval is the default value of
// RPCHeartbeatIntervalAndHalfTimeout used by the RPC context. The heartbeat
// timeout is twice this value, and we want that to be equivalent to
// NetworkTimeout to quickly detect peer unavailability.
defaultRPCHeartbeatInterval = NetworkTimeout / 2

// defaultRaftElectionTimeoutTicks specifies the number of Raft Tick
// invocations that must pass between elections.
defaultRaftElectionTimeoutTicks = envutil.EnvOrDefaultInt(
Expand Down
2 changes: 1 addition & 1 deletion pkg/rpc/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ var (
)

// GRPC Dialer connection timeout.
var minConnectionTimeout = 5 * time.Second
var minConnectionTimeout = base.NetworkTimeout

// errDialRejected is returned from client interceptors when the server's
// stopper is quiescing. The error is constructed to return true in
Expand Down

0 comments on commit acb8c67

Please sign in to comment.