[Replicated] release-23.1: gossip: adjust recovery timings to tolerate shorter lease expiration #36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Replicated from original PR cockroachdb#134603
Original author: nvanbenschoten
Original creation date: 2024-11-07T23:36:34Z
Original reviewers: miraradeva
Original description:
Backport 1/1 commits from cockroachdb#133300.
/cc @cockroachdb/release
Fixes cockroachdb#133159.
This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the
ranges.unavailable
metric.This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.
Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the
ranges.unavailable
metric.Release justification: low risk change to avoid false positive alerts.