-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: consistency check failed on cyan #29252
Comments
|
so it looks like this time one of the followers has a pending merge transaction, whereas the other follower and the leader don't have it. |
wait a minute, it's the other way around. Presumably leader and one follower have this pending txn, but not another follower? |
There's a second, identical, consistency failure in the logs later (which makes sense - once inconsistent, twice inconsistent). The transaction timestamp boils down to 180827 at 11:30:15, at which point we find this (note that the start key of r5911 is the anchor of the inconsistent txn record):
and that seems to go through, for we see all replicas disappear:
shortly thereafter we see r5911 itself get eaten:
then r982 gets eaten (note r7900s extended bounds in the last line)...
we can chase this down all the way, but I think it's fair to say that the damage has already been done at this point (in fact, the damage likely gets done pretty much at 180827 11:30:16). |
With some newly developed options to
I verified that node i has nodeID i, so our mystery transaction appears to around on nodeIDs 1-3. So my confusion above is confirmed: r32 is on 1,3,5 at the time of the crash, and it seems that 1 and 3 actually have this txn, it's just that it's "missing" on 5. Obviously it's weird and a bit concerning to have that transaction open in the first place, after all, it's been more than a day and the corresponding merge doesn't seem to be active any more? Time to look for a merge that would've been anchored on a replica on nodes 1-3. This is confusing, because as far as I can tell none of the merges I looked at so far live on these nodes. Here are a few that do, but none of them really match up with the timestamp (but there could've been retries; they're not logged). I don't really think that there's a big point in looking into these transactions. The merge above has the matching start key and time, so it's more than likely that it wrote the txn record, but not on n1-3.
|
I'm seeing this
shortly before 11:30:16 which tells me that mostly likely during the problematic merge transaction at 11:30:14 r5911 had descriptor But isn't something backwards here:
Check out the end keys, the expected one is larger than the actual one, but the expected one has a smaller generation than the actual one:
I'm not finding anything about any splits in these logs. |
Some of my analysis above might be muddled. I'm just realizing that on 180827, we didn't have #29079 yet (the SHA was e7d570e), though we did have #29067. Either way, the problem here can't be pinned on the consistency checker. We have these keys in three places, but these three places aren't the three places where it should (or should not) be. |
@benesch I've got to run out for a while (and I've also stared at this for too long). Time for you to hopefully figure it out :-) |
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None
29328: cli: add hex option to debug keys r=a-robinson a=tschottdorf This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None
29390: backport-2.1: storage, roachtest, cli: assorted backports r=petermattis a=tschottdorf storage: remove test-only data race The test was stopping the engine before the stopper, so the compactor was sometimes able to use the engine while it was being closed. Fixes #29302. roachtest: improve TestMonitor Add two more test cases about the exit status of the `monitor` invocation itself. roachtest: fix flake in TestMonitor Accidentally changed this in #29178. cli: add hex option to debug keys This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. storage: skip TestClosedTimestampCanServe It won't be the first one I'm looking into. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Hmm, I assume this is some bug in our mostly untested debug tools. Looking. |
Oh, great. Something is broken unmarshaling keys of the following form. I doubt those are actual range descriptors, they probably just look enough like them.
|
Working around this error, we can see the history of range descriptors on r32 on node 1's dump. First of all, here's the offending txn again:
The following range descriptors are newer than that transaction:
The above basically says that r32 moved from Now the interesting range descriptor which must have been written by the offending transactions (since it's a merge and the timestamps match):
So while located on The following descriptors are older. Looking at them in chronological order (i.e. reverse from what's here) the range first lived on
|
I bet these are just the range descriptors for ranges that have been merged away. |
I don't see where the operation would wait for the RHS to have caught up. If the n9 replica of r10033 hasn't caught up to the merge yet, what in |
The SubsumeRequest doesn't wait; that's done later in the merge transaction: cockroach/pkg/storage/replica_command.go Lines 528 to 531 in acc8f83
|
Good point. I looked at the server-side impl of that request and one thing that stood out is that it's not synchronizing with the raft mu (just the replica mutex). Looking at how this is updated, that shouldn't make a difference, though - we write the data first, and then bump the lease applied index (both on regular application and snapshots). I also looked into interleavings in which somehow the second merge could use a stale descriptor that simply wouldn't be checking n9, but found no way that this could happen. The second merge places intents on both range descriptors before doing anything critical, as it should. I still think this is somehow where the bug is, though perhaps I'm wrong. |
Extracted and interleaved logs from all nodes: https://gist.github.com/benesch/bbdb986899ca5e972ce5576149b88fbe |
The timing of these snapshots is instructive.
We see a first preemptive snapshot for r10033 right when r10033 is merging its neighbor, which is probably a first attempt at |
Well well well. n9 is getting a preemptive snapshot for the right start key... but it applies the snapshot with a totally wrong start key:
|
Specifically it is somehow using its old end key (from the first snapshot) as the start key. o_O |
So somehow r10033 transformed itself into the right neighbor it subsumed via a snapshot. |
I'm honestly a little surprised that a consistency check is where this fails. Shouldn't some assertion fire? |
|
|
So is it likely that this snapshot
which seems to reflect the correct post-first-merge r10033, somehow addressed only the subsumed portion? Or is the error on the receiving end? |
Noticed during cockroachdb#29252. Release note: None
Ok, my mind is blown:
The outgoing snapshot looks correct and yet the incoming snapshot has the busted range descriptor! o_O |
Are you trusting this code? In particular, there's |
HA! @nvanbenschoten managed to spot the bug in like 30s of looking at this code. It's here: cockroach/pkg/storage/store_snapshot.go Line 412 in 6ff2995
We're just mutating the header range descriptor directly instead of making a copy of it. 🤦♂️ |
Hmm, |
❤️ ❤️ ❤️ Glad this got done so quickly. |
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
29646: roachtest: better merge testing in clearrange r=benesch a=tschottdorf Unfortunately, the method to determine the range count is quite slow since crdb_internal.ranges internally sends an RPC for each range to determine the leaseholder. Anecdotally, I've seen ~25% of the merges completed after less than 15 minutes. I know that it's slowing down over time, but @benesch will fix that. Also throws in aggressive consistency checks so that when something goes out of sync, we find out right there. Release note: None 29677: storage: preserve consistency when applying widening preemptive snapshots r=benesch a=benesch Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes #29252. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None
https://sentry.io/cockroach-labs/cockroachdb/issues/666877375/
The text was updated successfully, but these errors were encountered: