-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backport-19.2: terminate nodes in minority on consistency failure #42149
Conversation
Previously, a consistency failure was always framed as a deviation from the leaseholder. However, in practice a mismatch is always symmetric: the leaseholder could be right and the follower wrong; the follower might be right; and in fact the (usually 2+) followers may agree that the leaseholder is wrong. We've historically found that the majority was usually right, that is, with very few exceptions (I remember a single one), if there is one outlier replica, that replica is usually the problematic one, or more generally, the SHA with the lowest multiplicity is usually the one worth suspecting (we call this a "minority" in this context). The consistency checker currently fatals the leaseholder indiscriminately. This is an especially poor choice that was made out of convenience. Killing the leaseholder, which may very well have the correct data, forces the lease to move to a follower, often resulting in the corrupt node taking over the lease and spreading the corruption further. In a follow-up change, we'll want to kill the replicas representing the SHA with the lowest multiplicity instead of defaulting to the leaseholder. Release note (general change): Improve the consistency checker's log output.
Following up on cockroachdb#41893, the consistency checker now instructs the replicas corresponding to the SHA with the lowest multiplicity to terminate. Note that this change is a candidate for backporting into 19.2. An "old" leaseholder will never populate the newly introduced field, and will always terminate itself. A new leaseholder will populate the field, but only new nodes will interpret it. In the worst case, the result will be a consistency failure that does not lead to a fatal error (but collects RocksDB checkpoints just the same). Since no significant time is usually spent in a mixed-patch-releases, this is not a concern that outweighs the benefit of terminating (heuristically) the "right" nodes. Release note (general improvement): when the replicas within a range have found to been corrupted, the outliers will be terminated. Previously, the leaseholder replica would terminate, regardless of which replicas disagreed with each other. This is expected to curb the spreading of corrupted data better than the previous approach.
If stats mismatches aren't severe (which has been our stance so far), I don't think we want to backport changes to crash on them. We can backport the fix for the known issue #42148 (although it'll go into 19.2.1 now), but I'd let stats mismatches continue without aborting in case we've missed some causes. |
@bdarnell the PR is mislabeled. You need to |
OK. If it's for data mismatches instead of stats mismatches then I don't object as much, but this is still a significant change to a little-exercised code path and I don't think it's a great candidate for backporting. I'd personally leave this for 20.1 |
Yeah, this isn't a slam dunk for backporting. There is a certain risk in this backport but also the current code tends to spread inconsistencies around the cluster, something that really bit us in the past. I'll bring this up in the weekly tomorrow. |
@bdarnell when are you OK with merging this backport? Note that I'm planning to merge #42264 only a) after the 19.2 release or b) after this backport has landed, whichever happens earlier. This is because otherwise we'll get spurious test failures because we can't skip the roachtest on only one of the branches right now. |
Let's wait a few days after 19.2 is out to see if there is going to be a quick 19.2.1. |
Many deployments auto-restart crashed nodes unconditionally. As a result, we are often pulled into inconsistency failures relatively late, which results in a loss of information (in particular if the source of the problem turns out to be one of the nodes that did not actually crash); if the node that crashes *is* the one with the problem, restarting it lets it serve data again (until the next time the consistency checker nukes it), which is bad as well. This commit introduces a "death rattle" - if a node is told to terminate as the result of a consistency check, it will write a marker file that prevents subsequent restarts of the node with an informative message alerting the operator that there is a serious problem that needs to be addressed. We use the same strategy when a replica finds that its internal invariants are violated. Fixes cockroachdb#40442. Release note (general change): nodes that have been terminated as the result of a failed consistency check now refuse to restart, making it more likely that the operator notices that there is a persistent issue in a timely manner.
Cherry-picked #42401 into this branch. Looks like we're holding off on the merge for a bit longer, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go ahead and merge now that the commit has been picked for 19.2.1.
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release