Skip to content

Commit

Permalink
Merge #133300
Browse files Browse the repository at this point in the history
133300: gossip: adjust recovery timings to tolerate shorter lease expiration r=miraradeva a=nvanbenschoten

Fixes #133159.

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the `ranges.unavailable` metric.

This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the `ranges.unavailable` metric.

Co-authored-by: Nathan VanBenschoten <[email protected]>
  • Loading branch information
craig[bot] and nvanbenschoten committed Nov 6, 2024
2 parents f5f5cbc + 8f00a5b commit fe2ac3d
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 9 deletions.
17 changes: 10 additions & 7 deletions pkg/base/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -771,14 +771,17 @@ func (cfg RaftConfig) StoreLivenessDurations() (livenessInterval, heartbeatInter
return
}

// SentinelGossipTTL is time-to-live for the gossip sentinel. The sentinel
// informs a node whether or not it's connected to the primary gossip network
// and not just a partition. As such it must expire fairly quickly and be
// continually re-gossiped as a connected gossip network is necessary to
// propagate liveness. The replica which is the lease holder of the first range
// gossips it.
// SentinelGossipTTL is time-to-live for the gossip sentinel, which is gossiped
// by the leaseholder of the first range. The sentinel informs a node whether or
// not it is connected to the primary gossip network and not just a partition.
// As such it must expire fairly quickly and be continually re-gossiped as a
// connected gossip network is necessary to propagate liveness. Notably, it must
// expire faster than the liveness records carried by the gossip network so that
// a gossip partition is detected and healed before that liveness information
// expires. Failure to do so can result in false positive dead node detection,
// which can show up as false positive range unavailability in metrics.
func (cfg RaftConfig) SentinelGossipTTL() time.Duration {
return cfg.RangeLeaseDuration
return cfg.RangeLeaseDuration / 2
}

// DefaultRetryOptions should be used for retrying most
Expand Down
2 changes: 1 addition & 1 deletion pkg/base/testdata/raft_config
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,4 @@ RangeLeaseDurations: active=6s renewal=3s
RangeLeaseAcquireTimeout: 4s
NodeLivenessDurations: active=6s renewal=3s
StoreLivenessDurations: active=6s renewal=3s
SentinelGossipTTL: 6s
SentinelGossipTTL: 3s
2 changes: 1 addition & 1 deletion pkg/gossip/gossip.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ const (
// defaultStallInterval is the default interval for checking whether
// the incoming and outgoing connections to the gossip network are
// insufficient to keep the network connected.
defaultStallInterval = 2 * time.Second
defaultStallInterval = 1 * time.Second

// defaultBootstrapInterval is the minimum time between successive
// bootstrapping attempts to avoid busy-looping trying to find the
Expand Down

0 comments on commit fe2ac3d

Please sign in to comment.