gossip: adjust recovery timings to tolerate shorter lease expiration #133300

nvanbenschoten · 2024-10-23T20:39:29Z

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the ranges.unavailable metric.

This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the ranges.unavailable metric.

cockroach-teamcity · 2024-10-23T20:39:39Z

This change is

miraradeva

Not sure if it's feasible at all but some sort of test would be nice to ensure that we don't hit the same problem again if these settings are changed. Something that reproduces the issue in a simple way. Only if you think it's easy to do; I haven't spent much time thinking if it will require manipulating the gossip network in unnatural ways.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Fixes cockroachdb#133159. This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the `ranges.unavailable` metric. This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system. Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the `ranges.unavailable` metric.

nvanbenschoten · 2024-11-06T23:05:39Z

Not sure if it's feasible at all but some sort of test would be nice to ensure that we don't hit the same problem again if these settings are changed. Something that reproduces the issue in a simple way. Only if you think it's easy to do; I haven't spent much time thinking if it will require manipulating the gossip network in unnatural ways.

I spent some time working on a test for this. #134477 demonstrates what that might look like. It periodically drains and restarts nodes in a TestCluster and watches to see whether any node reports a a non-zero ranges.unavailable value.

Unfortunately, the test is timing-based (gossip needs to recover in time), so it's going to be very difficult to stabilize and eliminate all sources of flakiness. It also takes minutes to run, so it's not a great candidate for a unit test. That said, I did observe that the failure rate dropped from 47% to 3% with the change in this PR, which is a good indication that this change is behaving as expected. There's also an explanation for the 47% failure rate before, which is that with the SentinelGossipTTL and the lease duration both set to 6s and both being extended at random points, there should be about a 50% change of the liveness records expiring before the sentinel info.

Since landing a stable test for the property that a rolling restart never causes ranges.unavailable to be non-zero will be a large amount of work (in part because it's not a guarantee and relies on timing), I'll go ahead and merge this PR as is.

bors r=miraradeva

craig · 2024-11-06T23:47:41Z

Build succeeded:

blathers-crl · 2024-11-06T23:47:47Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #133159: branch-release-23.2, branch-release-24.1, branch-release-24.2, branch-release-24.3.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl · 2024-11-06T23:48:01Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 8f00a5b to blathers/backport-release-23.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

error creating merge commit from 8f00a5b to blathers/backport-release-23.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.

error creating merge commit from 8f00a5b to blathers/backport-release-24.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.

error creating merge commit from 8f00a5b to blathers/backport-release-24.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.2.x failed. See errors above.

error setting reviewers, but backport branch blathers/backport-release-24.3-133300 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134480/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.3.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

miraradeva

There's also an explanation for the 47% failure rate before ...

This makes a lot of sense. I wonder if the remaining 3% are timing-related test issues or the probability that it takes more than 2 attempts to reconnect properly (e.g. because of refused incoming connections or culling). I'd be curious to see if the 3% goes down if the gossip sentinel TTL is at 2s, so a node has 3 attempts to reconnect.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

nvanbenschoten requested a review from miraradeva October 23, 2024 20:39

nvanbenschoten requested review from a team as code owners October 23, 2024 20:39

miraradeva approved these changes Oct 23, 2024

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/sentinelGossipTTLTime branch from 11a23bb to 8f00a5b Compare November 5, 2024 23:24

craig bot merged commit fe2ac3d into cockroachdb:master Nov 6, 2024
23 checks passed

blathers-crl bot mentioned this pull request Nov 6, 2024

gossip: partition recovery is slow enough to trigger false-positives in ranges.unavailable #133159

Closed

blathers-crl bot mentioned this pull request Nov 6, 2024

release-24.3: gossip: adjust recovery timings to tolerate shorter lease expiration #134480

Merged

miraradeva reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossip: adjust recovery timings to tolerate shorter lease expiration #133300

gossip: adjust recovery timings to tolerate shorter lease expiration #133300

nvanbenschoten commented Oct 23, 2024

cockroach-teamcity commented Oct 23, 2024

miraradeva left a comment

nvanbenschoten commented Nov 6, 2024

craig bot commented Nov 6, 2024

blathers-crl bot commented Nov 6, 2024

blathers-crl bot commented Nov 6, 2024

miraradeva left a comment

gossip: adjust recovery timings to tolerate shorter lease expiration #133300

gossip: adjust recovery timings to tolerate shorter lease expiration #133300

Conversation

nvanbenschoten commented Oct 23, 2024

cockroach-teamcity commented Oct 23, 2024

miraradeva left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Nov 6, 2024

craig bot commented Nov 6, 2024

blathers-crl bot commented Nov 6, 2024

blathers-crl bot commented Nov 6, 2024

miraradeva left a comment

Choose a reason for hiding this comment