-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal consistency check failed with 1 inconsistent replicas #35424
Comments
Link to debug.zip https://drive.google.com/open?id=1qkn-XLw4Kno1ZnbHCFvXQL_DNCABME6y |
|
cluster: andy-72 |
decoded kv pairs:
|
Repro steps: Resulting in:
|
Here's what we know:
A couple of interesting things:
We have the Raft logs for the range from some index onwards; we have the second split record but not the first. So what now? Unclear.
These are long shots, but what is there to do here but go really guerilla? @ajkr would you kindly assist me in squeezing RocksDB? Also, I've run @tbg, what detection do we have for bad stats? Does any queue recompute them? Perhaps we should introduce something that crashes. Attaching the range report: Let's make it a party @bdarnell @petermattis @nvanbenschoten @ajwerner |
What's the check-store output on n3? Specifically, is it OK with the stats for r336? The stats, when "consistently inconsistent" (i.e. agreeing across the replicas, but not agreeing with recomputing from scratch) are actually fixed by the consistency checker. Along with the hash, it also computes the stats and when there's no inconsistency but a stats delta, it triggers a recomputation. In that sense, incorrect stats itself are not as huge a problem, but the code we wrote should have them consistent though it's been known (#28946) that we seem to not manage. Note that ranges that hold timeseries data always have incorrect stats but that's ok because they have What's in the Raft logs? I think that the debug command needs to be brushed up a bit to be useful. cockroach/pkg/cli/debug/print.go Line 94 in d3aae3e
This code should also pretty-print the |
Nevermind, n4 is likely consistent with n3, and is ok with r336. |
In the n2 dump, check r3289 -- it's adjacent to r336, isn't it? And its stats are exactly the negative sign of those of the corrupt r336 -- that can't be coincidence:
|
r3289 was created through the second split
Now I would feel a lot better if someone told me that the stray txn record also belonged to the second split, because then I could stop worrying about the first split. Anyway, these stats above tell a story - r336 after the second split claims to have exactly the negative actual stats of r336. This would happen if r336 at the time of the split trigger had completely zero stats. (I have to double check that in the code, but intuitively it makes sense that if the LHS before the split thinks that it has 0, and it gives away 68814, then after it should think it has -68814). Of course we know that it didn't have zero stats. |
The raft log entry for the second split is has these stats (in the split trigger):
I think I tried to look up the range status for r3289, but... it says the range was not found! Nothing was in the logs attached here, so I pulled the new logs and found this:
(this is the stats recomputation mechanism I described above; in particular, the consistency checker found no divergence, though this is after a bunch of rebalancing that may well have masked any problem). Somewhat later, the range gets merged away, which explains why I wasn't seeing it.
Notice something? r336 merges that range back in. This wasn't reflected in the info posted above so far (because it hadn't happened yet), but look what it did to the range status at r336: |
It seems to belong to the first split, tho. The txn record is:
That timestamp is 13:21:01 GMT. |
But when "inconsistently inconsistent", they stay that way, because the consistency checker works by computing a delta and applying it equally to all replicas. The only time we write absolute values to the stats is when creating new stats records for the RHS of a split. At all other times we adjust them with deltas, so if the first split led to a discrepancy it would persist across future splits and the actions of the consistency queue. |
A second node died I extended the cluster so it's still available. |
It's possible that #34989 is the same issue. |
Looks like the same inconsistency. This time, the leaseholder had the incorrect stats and compared itself with n4 (the one with the correct stats), so it killed itself. The new leaseholder is also incorrect and so the newly added replica is also incorrect. Effectively the same situation as before. |
That's a shame, wasn't able to find the logs for it (was hoping it was a roachtest, but apparently not). |
I'm poking at this with So far I'm able to read the stats key, and am waiting for the command that dumps all the historical versions to return (I didn't figure out how to do that for a single key, it seems to dump the whole db). |
The RocksDB sleuthing worked out, though I can't say I got very good information out of it (next time this happens, we need to immediately shut down the nodes that are inconsistent to preserve more history at the Rocks level) Below is the earliest applied state from the incorrect node (n2), at lease applied index 18. The Raft command before that one was the second split (which doesn't matter, just pointing it out). Note the sys count of four
Now here's :3 (the consistent node that crashed) at the same lease applied index (not the same Raft log position, but any command that can change sys_count, I think, must assign a lease index)
At the same time :3 has one transaction record extra (the one for the first split), which is a sys_count delta of one. I verified the range-descriptors on :3 and :6 and they agree (up to the point at which :3 crashed). So both replicas applied all of the splits similarly.
This doesn't seem to add up. Let's assume the first split entirely messed things up and left n2 (and the other replica) with mostly zero stats but somehow initializes the RHS successfully so that it never shows up here (and I just manually ran a consistency check, which didn't kill any additional nodes). n4 presumably has both sides of the split 100% correct at this point, so the split trigger computed by n2 must've reflected the reality on the ground stats-wise. Now how would the second split mess up the range stats on the RHS of the second split (r3289)? The consistent node (crashed) shows that as the only range for which the persisted stats don't match, but it itself evaluated that split. On the other hand, if the first split was fine, and the second somehow did damage which is more plausible looking at the stats, then why is there a transaction record on the consistent replica and not the others? The txn record is supposed to be there, and it also wouldn't be deleted synchronously with the split (due to the external intents). So what may have deleted that transaction proto? Committing the range descriptor would take place in the same WriteBatch as committing the txn proto. And the range descriptor seems to have committed throughout. The stats on the consistent node suggest that the txn record ought to be there. Then how did it disappear elsewhere? |
@awoods187 I was trying to run this kind of cluster 5x but ran into the resource limit for all of them. Do you know how to bump the limit? |
(I'll try with some less beefy machines for now) |
It’s easy enough to bump the resource limit. You have to go to the support center in the AWS console and file a ticket. It has a form for the region, type, and quota size. For smaller bumps it should be automated. For larger ones they usually get to it within a couple of hours. |
Let's recap. We seem to have three unexplainable things happening here:
What's the common thread between facts? Beats me. |
Unfortunately I was not able to get any more Raft logs or info. The Raft log key range for r336 is
And so I've tried stuff like:
And I got nothing before log index 34, which we already had. I've also taught I've been also trying to reproduce by running the |
#35861 should help tickle these things as they happen. |
Describe the problem
Hit a fatal that killed a node while importing TPC-C 10k on a 7 node 72 cpu machine.
To Reproduce$CLUSTER -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod create $CLUSTER -n 8 --clouds=aws --aws-machine-type-ssd=c5d.18xlarge
roachprod run
roachprod stage $CLUSTER:1-7 cockroach
roachprod stage $CLUSTER:8 workload
roachprod start $CLUSTER:1-7 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Environment:
v19.1.0-beta.20190304-135-gd8f7e85
cockroach.log
The text was updated successfully, but these errors were encountered: