Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Replicated] release-23.1: gossip: adjust recovery timings to tolerate shorter lease expiration #36

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

mohini-crl
Copy link
Owner

Replicated from original PR cockroachdb#134603

Original author: nvanbenschoten
Original creation date: 2024-11-07T23:36:34Z

Original reviewers: miraradeva

Original description:

Backport 1/1 commits from cockroachdb#133300.

/cc @cockroachdb/release


Fixes cockroachdb#133159.

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the ranges.unavailable metric.

This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the ranges.unavailable metric.


Release justification: low risk change to avoid false positive alerts.

Fixes cockroachdb#133159.

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is
no longer aligned with the node liveness expiration of 6s. The sentinel
key informs gossip whether it is connected to the primary gossip network
or a partition and thus needs a short TTL so that partitions are fixed
quickly. In particular, partitions need to resolve faster than the
timeout (6s) or node liveness will be adversely affected, which can
trigger false-positives in the `ranges.unavailable` metric.

This commit also reduces the gossip stall check interval from 2s to 1s.
The stall check interval also affects how quickly gossip partitions are
noticed and repaired, controlling how frequently gossip connection
attempts are made. The stall check itself is very cheap, so this
produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip
network when a node crashes in order to eliminate false positives in the
`ranges.unavailable` metric.
@mohini-crl mohini-crl merged commit d88d2f9 into master Dec 19, 2024
1 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gossip: partition recovery is slow enough to trigger false-positives in ranges.unavailable
2 participants