Skip to content

Commit

Permalink
rpc: tweak heartbeat intervals and timeouts
Browse files Browse the repository at this point in the history
The RPC heartbeat interval and timeout were recently reduced to 2
seconds (`base.NetworkTimeout`), with the assumption that heartbeats
require a single network roundtrip and 2 seconds would therefore be
more than enough.

However, high-latency experiments showed that clusters under TPCC import
load were very unstable even with a relatively moderate 400ms RTT,
showing frequent RPC heartbeat timeouts because RPC `Ping` requests are
head-of-line blocked by other RPC traffic.

This patch therefore reverts the RPC heartbeat timeout back to the
previous 6 second value, which is stable under TPCC import load with
400ms RTT, but struggles under 500ms RTT (which is also the case for
22.2). However, the RPC heartbeat interval and gRPC keepalive ping
intervals have been split out to a separate setting `PingInterval`
(`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail
faster despite the very high timeout.

Unfortunately, this increases the maximum lease recovery time
during network outages from 9.7 seconds to 14.0 seconds (as measured by
the `failover/non-system/blackhole` roachtest), but that's still better
than the 18.1 seconds in 22.2.

Epic: none
Release note (ops change): The RPC heartbeat and gRPC keepalive ping
intervals have been reduced to 1 second, to detect failures faster. This
is adjustable via the new `COCKROACH_PING_INTERVAL` environment
variable. The timeouts remain unchanged.
  • Loading branch information
erikgrinaker committed Dec 11, 2022
1 parent c050c9b commit e97fe41
Show file tree
Hide file tree
Showing 6 changed files with 74 additions and 45 deletions.
77 changes: 48 additions & 29 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,41 +113,57 @@ func DefaultHistogramWindowInterval() time.Duration {
var (
// NetworkTimeout is the timeout used for network operations that require a
// single network round trip. It is conservatively defined as one maximum
// network round trip time (RTT) plus one TCP packet retransmit (RTO), then
// multiplied by 2 as a safety margin.
// network round trip time (RTT) plus one TCP packet retransmit (RTO) with an
// additional safety margin.
//
// The maximum RTT between cloud regions is roughly 350ms both in GCP
// (asia-south2 to southamerica-west1) and AWS (af-south-1 to sa-east-1). It
// can occasionally be up to 500ms, but 400ms is a reasonable upper bound
// under nominal conditions.
// The maximum RTT between cloud regions is roughly 400ms both in GCP
// (asia-south2 to southamerica-west1) and AWS (af-south-1 to sa-east-1), but
// p99 RTTs can occasionally approach 600ms.
// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB
// https://www.cloudping.co/grid/p_99/timeframe/1W
//
// Linux has an RTT-dependant retransmission timeout (RTO) which we can
// approximate as 1.5x RTT (smoothed RTT + 4x RTT variance), with a lower
// bound of 200ms. Under nominal conditions, this is approximately 600ms.
// bound of 200ms. It can thus be up to 900ms in the worst case.
//
// The maximum p99 RPC heartbeat latency in any Cockroach Cloud cluster over a
// 90-day period was 557ms. This was a single-region US cluster, where the
// high latency appeared to be due to CPU overload or throttling: the cluster
// had 2 vCPU nodes running at 100%.
//
// The NetworkTimeout is thus set to 2 * (400ms + 600ms) = 2s.
//
// TODO(erikgrinaker): Consider reducing this to 1 second, which should be
// sufficient but may be fragile under latency fluctuations.
// The NetworkTimeout is therefore set to 2 seconds: 600ms RTT plus 900ms RTO
// plus a 500ms safety margin.
NetworkTimeout = envutil.EnvOrDefaultDuration("COCKROACH_NETWORK_TIMEOUT", 2*time.Second)

// DialTimeout is the timeout used when dialing a node. gRPC connections take
// up to 3 roundtrips for the TCP + TLS handshakes. Because NetworkTimeout
// allows for both a network roundtrip (RTT) and a TCP retransmit (RTO), with
// the RTO being greater than the RTT, and we don't need to tolerate more than
// 1 retransmit per connection attempt, 2 * NetworkTimeout is sufficient.
// up to 3 roundtrips for the TCP + TLS handshakes. NetworkTimeout allows for
// both a network roundtrip and a TCP retransmit, but we don't need to
// tolerate more than 1 retransmit per connection attempt, so
// 2 * NetworkTimeout is sufficient.
DialTimeout = 2 * NetworkTimeout

// defaultRPCHeartbeatIntervalAndTimeout is the default value of
// RPCHeartbeatIntervalAndTimeout used by the RPC context.
defaultRPCHeartbeatIntervalAndTimeout = NetworkTimeout
// PingInterval is the interval between network heartbeat pings. It is used
// both for RPC heartbeat intervals and gRPC server keepalive pings. It is
// set to 1 second in order to fail fast, but with large default timeouts
// to tolerate high-latency multiregion clusters.
PingInterval = envutil.EnvOrDefaultDuration("COCKROACH_PING_INTERVAL", time.Second)

// defaultRPCHeartbeatTimeout is the default RPC heartbeat timeout. It is set
// very high at 3 * NetworkTimeout for several reasons: the gRPC transport may
// need to complete a dial/handshake before sending the heartbeat, the
// heartbeat has occasionally been seen to require 3 RTTs even post-dial (for
// unknown reasons), and under load the heartbeat may be head-of-line blocked
// by other RPC traffic.
//
// High-latency experiments with 6s RPC heartbeat timeouts showed that
// clusters running TPCC imports were stable at 400ms RTT, but started seeing
// RPC heartbeat failures at 500ms RTT. With light load (e.g. rate-limited
// kv50), clusters were stable at 1000ms RTT.
//
// The maximum p99 RPC heartbeat latency in any Cockroach Cloud cluster over a
// 90-day period was found to be 557ms. This was a single-region US cluster
// where the latency was caused by CPU overload.
//
// TODO(erikgrinaker): We should avoid head-of-line blocking for RPC
// heartbeats and reduce this to NetworkTimeout (plus DialTimeout for the
// initial heartbeat), see:
// https://github.com/cockroachdb/cockroach/issues/93397.
defaultRPCHeartbeatTimeout = 3 * NetworkTimeout

// defaultRaftElectionTimeoutTicks specifies the number of Raft Tick
// invocations that must pass between elections.
Expand Down Expand Up @@ -252,11 +268,13 @@ type Config struct {
// This is computed from HTTPAddr if specified otherwise Addr.
HTTPAdvertiseAddr string

// RPCHeartbeatIntervalAndTimeout controls how often a Ping request is sent on
// peer connections to determine connection health and update the local view
// of remote clocks. This is also used as a timeout for heartbeats, so don't
// set this too low.
RPCHeartbeatIntervalAndTimeout time.Duration
// RPCHeartbeatInterval controls how often Ping requests are sent on peer
// connections to determine connection health and update the local view of
// remote clocks.
RPCHeartbeatInterval time.Duration

// RPCHearbeatTimeout is the timeout for Ping requests.
RPCHeartbeatTimeout time.Duration

// SecondaryTenantPortOffset is the increment to add to the various
// addresses to generate the network configuration for the in-memory
Expand Down Expand Up @@ -314,7 +332,8 @@ func (cfg *Config) InitDefaults() {
cfg.SQLAdvertiseAddr = cfg.SQLAddr
cfg.SocketFile = ""
cfg.SSLCertsDir = DefaultCertsDirectory
cfg.RPCHeartbeatIntervalAndTimeout = defaultRPCHeartbeatIntervalAndTimeout
cfg.RPCHeartbeatInterval = PingInterval
cfg.RPCHeartbeatTimeout = defaultRPCHeartbeatTimeout
cfg.ClusterName = ""
cfg.DisableClusterNameVerification = false
cfg.ClockDevicePath = ""
Expand Down
12 changes: 7 additions & 5 deletions pkg/rpc/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -365,8 +365,9 @@ type Context struct {
RemoteClocks *RemoteClockMonitor
MasterCtx context.Context

heartbeatTimeout time.Duration
HeartbeatCB func()
heartbeatInterval time.Duration
heartbeatTimeout time.Duration
HeartbeatCB func()

rpcCompression bool

Expand Down Expand Up @@ -571,15 +572,16 @@ func NewContext(ctx context.Context, opts ContextOptions) *Context {
rpcCompression: enableRPCCompression,
MasterCtx: masterCtx,
metrics: makeMetrics(),
heartbeatTimeout: opts.Config.RPCHeartbeatIntervalAndTimeout,
heartbeatInterval: opts.Config.RPCHeartbeatInterval,
heartbeatTimeout: opts.Config.RPCHeartbeatTimeout,
logClosingConnEvery: log.Every(time.Second),
}

// We only monitor remote clocks in server-to-server connections.
// CLI commands are exempted.
if !opts.ClientOnly {
rpcCtx.RemoteClocks = newRemoteClockMonitor(
opts.Clock, opts.MaxOffset, 10*opts.Config.RPCHeartbeatIntervalAndTimeout, opts.Config.HistogramWindowInterval())
opts.Clock, opts.MaxOffset, 10*opts.Config.RPCHeartbeatTimeout, opts.Config.HistogramWindowInterval())
}

if id := opts.Knobs.StorageClusterID; id != nil {
Expand Down Expand Up @@ -2193,7 +2195,7 @@ func (rpcCtx *Context) runHeartbeat(
})
}

heartbeatTimer.Reset(rpcCtx.Config.RPCHeartbeatIntervalAndTimeout)
heartbeatTimer.Reset(rpcCtx.heartbeatInterval)
}
}

Expand Down
20 changes: 13 additions & 7 deletions pkg/rpc/context_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,8 @@ func testClockOffsetInPingRequestInternal(t *testing.T, clientOnly bool) {
clientOpts := opts
clientOpts.Config = testutils.NewNodeTestBaseContext()
// Experimentally, values below 50ms seem to incur flakiness.
clientOpts.Config.RPCHeartbeatIntervalAndTimeout = 100 * time.Millisecond
clientOpts.Config.RPCHeartbeatInterval = 100 * time.Millisecond
clientOpts.Config.RPCHeartbeatTimeout = 100 * time.Millisecond
clientOpts.ClientOnly = clientOnly
clientOpts.OnOutgoingPing = func(ctx context.Context, req *PingRequest) error {
select {
Expand Down Expand Up @@ -903,7 +904,8 @@ func TestHeartbeatHealth(t *testing.T) {
clientCtx.Config.AdvertiseAddr = lisLocalServer.Addr().String()

// Make the interval shorter to speed up the test.
clientCtx.Config.RPCHeartbeatIntervalAndTimeout = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatInterval = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatTimeout = 1 * time.Millisecond

m := clientCtx.Metrics()

Expand Down Expand Up @@ -1148,7 +1150,8 @@ func TestHeartbeatHealthTransport(t *testing.T) {

clientCtx := newTestContext(clusterID, clock, maxOffset, stopper)
// Make the interval shorter to speed up the test.
clientCtx.Config.RPCHeartbeatIntervalAndTimeout = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatInterval = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatTimeout = 1 * time.Millisecond
if _, err := clientCtx.GRPCDialNode(remoteAddr, serverNodeID, DefaultClass).Connect(context.Background()); err != nil {
t.Fatal(err)
}
Expand Down Expand Up @@ -1259,7 +1262,7 @@ func TestHeartbeatHealthTransport(t *testing.T) {
})

// Should stay unhealthy despite reconnection attempts.
for then := timeutil.Now(); timeutil.Since(then) < 50*clientCtx.Config.RPCHeartbeatIntervalAndTimeout; {
for then := timeutil.Now(); timeutil.Since(then) < 50*clientCtx.Config.RPCHeartbeatTimeout; {
err := clientCtx.TestingConnHealth(remoteAddr, serverNodeID)
if !isUnhealthy(err) {
t.Fatal(err)
Expand Down Expand Up @@ -1303,7 +1306,8 @@ func TestOffsetMeasurement(t *testing.T) {
clientMaxOffset := time.Duration(0)
clientCtx := newTestContext(clusterID, clientClock, clientMaxOffset, stopper)
// Make the interval shorter to speed up the test.
clientCtx.Config.RPCHeartbeatIntervalAndTimeout = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatInterval = 1 * time.Millisecond
clientCtx.Config.RPCHeartbeatTimeout = 1 * time.Millisecond
clientCtx.RemoteClocks.offsetTTL = 5 * clientClock.getAdvancementInterval()
if _, err := clientCtx.GRPCDialNode(remoteAddr, serverNodeID, DefaultClass).Connect(ctx); err != nil {
t.Fatal(err)
Expand Down Expand Up @@ -1465,7 +1469,8 @@ func TestRemoteOffsetUnhealthy(t *testing.T) {
clock := timeutil.NewManualTime(timeutil.Unix(0, start.Add(nodeCtxs[i].offset).UnixNano()))
nodeCtxs[i].errChan = make(chan error, 1)
nodeCtxs[i].ctx = newTestContext(clusterID, clock, maxOffset, stopper)
nodeCtxs[i].ctx.Config.RPCHeartbeatIntervalAndTimeout = maxOffset
nodeCtxs[i].ctx.Config.RPCHeartbeatInterval = maxOffset
nodeCtxs[i].ctx.Config.RPCHeartbeatTimeout = maxOffset
nodeCtxs[i].ctx.NodeID.Set(context.Background(), roachpb.NodeID(i+1))

s := newTestServer(t, nodeCtxs[i].ctx)
Expand Down Expand Up @@ -1681,7 +1686,8 @@ func grpcRunKeepaliveTestCase(testCtx context.Context, c grpcKeepaliveTestCase)
log.Infof(ctx, "setting up client")
clientCtx := newTestContext(clusterID, clock, maxOffset, stopper)
// Disable automatic heartbeats. We'll send them by hand.
clientCtx.Config.RPCHeartbeatIntervalAndTimeout = math.MaxInt64
clientCtx.Config.RPCHeartbeatInterval = math.MaxInt64
clientCtx.Config.RPCHeartbeatTimeout = math.MaxInt64

var firstConn int32 = 1

Expand Down
4 changes: 2 additions & 2 deletions pkg/rpc/keepalive.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ var clientKeepalive = keepalive.ClientParameters{
PermitWithoutStream: true,
}
var serverKeepalive = keepalive.ServerParameters{
// Send periodic pings on the connection.
Time: base.NetworkTimeout,
// Send periodic pings on the connection when there is no other traffic.
Time: base.PingInterval,
// If the pings don't get a response within the timeout, we might be
// experiencing a network partition. gRPC will close the transport-level
// connection and all the pending RPCs (which may not have timeouts) will
Expand Down
3 changes: 2 additions & 1 deletion pkg/rpc/nodedialer/nodedialer_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,8 @@ func newTestContext(
) *rpc.Context {
cfg := testutils.NewNodeTestBaseContext()
cfg.Insecure = true
cfg.RPCHeartbeatIntervalAndTimeout = 100 * time.Millisecond
cfg.RPCHeartbeatInterval = 100 * time.Millisecond
cfg.RPCHeartbeatTimeout = 100 * time.Millisecond
ctx := context.Background()
rctx := rpc.NewContext(ctx, rpc.ContextOptions{
TenantID: roachpb.SystemTenantID,
Expand Down
3 changes: 2 additions & 1 deletion pkg/server/server_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -464,7 +464,8 @@ func makeInMemoryTenantServerConfig(
baseCfg.Config.User = kvServerCfg.Config.User
baseCfg.Config.DisableTLSForHTTP = kvServerCfg.Config.DisableTLSForHTTP
baseCfg.Config.AcceptSQLWithoutTLS = kvServerCfg.Config.AcceptSQLWithoutTLS
baseCfg.Config.RPCHeartbeatIntervalAndTimeout = kvServerCfg.Config.RPCHeartbeatIntervalAndTimeout
baseCfg.Config.RPCHeartbeatInterval = kvServerCfg.Config.RPCHeartbeatInterval
baseCfg.Config.RPCHeartbeatTimeout = kvServerCfg.Config.RPCHeartbeatTimeout
baseCfg.Config.ClockDevicePath = kvServerCfg.Config.ClockDevicePath
baseCfg.Config.ClusterName = kvServerCfg.Config.ClusterName
baseCfg.Config.DisableClusterNameVerification = kvServerCfg.Config.DisableClusterNameVerification
Expand Down

0 comments on commit e97fe41

Please sign in to comment.