-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: adjust recovery timings to tolerate shorter lease expiration #133300
gossip: adjust recovery timings to tolerate shorter lease expiration #133300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's feasible at all but some sort of test would be nice to ensure that we don't hit the same problem again if these settings are changed. Something that reproduces the issue in a simple way. Only if you think it's easy to do; I haven't spent much time thinking if it will require manipulating the gossip network in unnatural ways.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Fixes cockroachdb#133159. This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the `ranges.unavailable` metric. This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system. Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the `ranges.unavailable` metric.
11a23bb
to
8f00a5b
Compare
I spent some time working on a test for this. #134477 demonstrates what that might look like. It periodically drains and restarts nodes in a Unfortunately, the test is timing-based (gossip needs to recover in time), so it's going to be very difficult to stabilize and eliminate all sources of flakiness. It also takes minutes to run, so it's not a great candidate for a unit test. That said, I did observe that the failure rate dropped from 47% to 3% with the change in this PR, which is a good indication that this change is behaving as expected. There's also an explanation for the 47% failure rate before, which is that with the SentinelGossipTTL and the lease duration both set to 6s and both being extended at random points, there should be about a 50% change of the liveness records expiring before the sentinel info. Since landing a stable test for the property that a rolling restart never causes bors r=miraradeva |
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #133159: branch-release-23.2, branch-release-24.1, branch-release-24.2, branch-release-24.3. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 8f00a5b to blathers/backport-release-23.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. error creating merge commit from 8f00a5b to blathers/backport-release-23.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. error creating merge commit from 8f00a5b to blathers/backport-release-24.1-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.1.x failed. See errors above. error creating merge commit from 8f00a5b to blathers/backport-release-24.2-133300: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.2.x failed. See errors above. error setting reviewers, but backport branch blathers/backport-release-24.3-133300 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134480/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 24.3.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also an explanation for the 47% failure rate before ...
This makes a lot of sense. I wonder if the remaining 3% are timing-related test issues or the probability that it takes more than 2 attempts to reconnect properly (e.g. because of refused incoming connections or culling). I'd be curious to see if the 3% goes down if the gossip sentinel TTL is at 2s, so a node has 3 attempts to reconnect.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Fixes #133159.
This commit reduces the gossip sentinel TTL from 6s to 3s, so that it is no longer aligned with the node liveness expiration of 6s. The sentinel key informs gossip whether it is connected to the primary gossip network or a partition and thus needs a short TTL so that partitions are fixed quickly. In particular, partitions need to resolve faster than the timeout (6s) or node liveness will be adversely affected, which can trigger false-positives in the
ranges.unavailable
metric.This commit also reduces the gossip stall check interval from 2s to 1s. The stall check interval also affects how quickly gossip partitions are noticed and repaired, controlling how frequently gossip connection attempts are made. The stall check itself is very cheap, so this produces no load on the system.
Release note (bug fix): Reduce the duration of partitions in the gossip network when a node crashes in order to eliminate false positives in the
ranges.unavailable
metric.