Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: adjust recovery timings to tolerate shorter lease expiration #133300

Conversation

nvanbenschoten
Copy link
Member

Fixes #133159.

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the ranges.unavailable metric.

This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the ranges.unavailable metric.

@nvanbenschoten nvanbenschoten requested review from a team as code owners October 23, 2024 20:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Not sure if it's feasible at all but some sort of test would be nice to ensure that we don't hit the same problem again if these settings are changed. Something that reproduces the issue in a simple way. Only if you think it's easy to do; I haven't spent much time thinking if it will require manipulating the gossip network in unnatural ways.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Fixes cockroachdb#133159.

This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is
no longer aligned with the node liveness expiration of 6s. The sentinel
key informs gossip whether it is connected to the primary gossip network
or a partition and thus needs a short TTL so that partitions are fixed
quickly. In particular, partitions need to resolve faster than the
timeout (6s) or node liveness will be adversely affected, which can
trigger false-positives in the `ranges.unavailable` metric.

This commit also reduces the gossip stall check interval from 2s to 1s.
The stall check interval also affects how quickly gossip partitions are
noticed and repaired, controlling how frequently gossip connection
attempts are made. The stall check itself is very cheap, so this
produces no load on the system.

Release note (bug fix): Reduce the duration of partitions in the gossip
network when a node crashes in order to eliminate false positives in the
`ranges.unavailable` metric.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/sentinelGossipTTLTime branch from 11a23bb to 8f00a5b Compare November 5, 2024 23:24
@nvanbenschoten nvanbenschoten added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3 labels Nov 5, 2024
@nvanbenschoten
Copy link
Member Author

Not sure if it's feasible at all but some sort of test would be nice to ensure that we don't hit the same problem again if these settings are changed. Something that reproduces the issue in a simple way. Only if you think it's easy to do; I haven't spent much time thinking if it will require manipulating the gossip network in unnatural ways.

I spent some time working on a test for this. #134477 demonstrates what that might look like. It periodically drains and restarts nodes in a TestCluster and watches to see whether any node reports a a non-zero ranges.unavailable value.

Unfortunately, the test is timing-based (gossip needs to recover in time), so it's going to be very difficult to stabilize and eliminate all sources of flakiness. It also takes minutes to run, so it's not a great candidate for a unit test. That said, I did observe that the failure rate dropped from 47% to 3% with the change in this PR, which is a good indication that this change is behaving as expected. There's also an explanation for the 47% failure rate before, which is that with the SentinelGossipTTL and the lease duration both set to 6s and both being extended at random points, there should be about a 50% change of the liveness records expiring before the sentinel info.

Since landing a stable test for the property that a rolling restart never causes ranges.unavailable to be non-zero will be a large amount of work (in part because it's not a guarantee and relies on timing), I'll go ahead and merge this PR as is.

bors r=miraradeva

@craig craig bot merged commit fe2ac3d into cockroachdb:master Nov 6, 2024
23 checks passed
Copy link

blathers-crl bot commented Nov 6, 2024

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #133159: branch-release-23.2, branch-release-24.1, branch-release-24.2, branch-release-24.3.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Nov 6, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 8f00a5b to blathers/backport-release-23.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 8f00a5b to blathers/backport-release-23.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


error creating merge commit from 8f00a5b to blathers/backport-release-24.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


error creating merge commit from 8f00a5b to blathers/backport-release-24.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.2.x failed. See errors above.


error setting reviewers, but backport branch blathers/backport-release-24.3-133300 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134480/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also an explanation for the 47% failure rate before ...

This makes a lot of sense. I wonder if the remaining 3% are timing-related test issues or the probability that it takes more than 2 attempts to reconnect properly (e.g. because of refused incoming connections or culling). I'd be curious to see if the 3% goes down if the gossip sentinel TTL is at 2s, so a node has 3 attempts to reconnect.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gossip: partition recovery is slow enough to trigger false-positives in ranges.unavailable
3 participants