-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224
Comments
|
backing up data/logs on nodes 4 and 5. It'll take a little while. data dir is ~80GiB |
There's another instance of this in #7184. |
Thanks @mberhault. I'll look into the data files and will try to find out which Range is the offending one, and what replicas and versions of meta entries we have. We shouldn't be seeing two different replicas with the same ReplicaID, naturally. |
@bdarnell: sorry, that's where my update went. I apparently mistakenly pasted the "gamma rollback -> clobber!" problem to the wrong bug, it was supposed to be part of this one. I"ll shuffle things around a bit |
We should also back up nodes 1,2,3 (god knows where the meta ranges are, or who sends the Raft message that kills things) |
The stacktrace is |
I updated the original issue message with more details, including the message on each node:
Node 5 is crashing with:
|
If I were to take a guess, I would guess that the message comes in both orders - can you verify that for Node5 (i.e. try a bunch of times, you should see random between the two messages) |
right you are. no need to run it again, just grep through the logs:
|
Messages with #7227 patched in: Node 4:
Node 5:
|
Thanks, looks like I need to update the PR (store-id switched) |
range 3667 is on nodes 2, 5, and 4. Here's the history for node4:
I'm going to run the same for node5 but am anticipating that we will have to look at Node 2 as well (which is running at the moment). |
now wait for it: node5:
This means that Node5 and Node4 have conflicting range descriptors - one thinks that 6 has been added at replicaID5, the other one thinks it's 2.
|
@mberhault I think this is sufficiently broken to bring down the rest of the nodes and pull backups from the ones that haven't been backed up yet. |
The timestamps in the diff above are |
We can see that Node4 itself proposed the replica change which led to its local range descriptor written at Shot in the blue: if, by some mechanism, we would sometimes delete intents that should commit, we'd probably see behaviour like this - the Raft group updated, but the on-disk state (at least partially) lost, similar to #5291. |
|
Node2 is also boring, it only receives a snapshot at a time between the first and second conflicting ChangeReplica (so it must be caused by the first, though it's awfully late for that)
|
Here's a breakdown of all propositions from the logs.
|
Scratch the post three above, I can't use grep. Node5 really is in on the action:
It grabs the lease from node4, proposes its conf change, sends a few snapshots, and gives the lease back to 4 (which then does the conflicting replica change). |
gist: all sent or received snapshots https://gist.github.com/tschottdorf/c9630dfce6a66c49a8e5c655053b0e2d tl;dr (not in temporal order) |
to comment on my "shot in the blue" earlier: node 4 and 5 have different versions of the same key, which should not be the cause of an operation which does the wrong thing on a replicated level. Should look at the Raft logs too to see what index they have applied. |
This is happening pretty regularly with the Relevant excerpt from acceptance log:
|
Cannot hurt to have this for cockroachdb#7224.
This repros quickly with
as pointed out by @cuongdo and @BramGruneir. I'm investigating. Looks like the first node proposes the two conflicting upreplications that soon thereafter let it clobber. I hope it's the same issue as here because if so, this is hopefully well-understood soon. |
Looks at least similar:
i.e. node3 and node2 both at some point messaged us and claimed to have replicaID=2 for range 1. Here are the descriptors for range 1:
That seems clean. Reading bottom to top, the range first splits, then adds store2 at replicaid=2, then adds store3 at replicaid=3. Then, finally, another split. But the panic message indicates something different, namely that both store3 and store2 talked to us claiming they had replica=2. And in fact, going through the info messages produced by the replica change, we see
and the second one
This means that store3 actually has replicaID=2 and store2 should have replicaID=3. I'm going to look at their data files next. |
Nodes 2 and 3 agree as far as the range descriptors and meta entries are concerned (the latter are read with |
The plot is thickening. It looks like the descriptors we write are somehow messed up: when store 3 gets replica_id 2, we write a desc with replica_id=2. Meaning that we probably somewhere mix up replica_id and store_id. Which means that when store 3 gets its replica first, and then store 2, they will be mixed up - it should be (3,2) (2,3) but you get (3,3) (2,2). However, the in-memory state doesn't have that issue. Still speculation, but I have a repro with binaries that I can put assertions into and a <10min repro cycle, so hope to nail this one soon. |
The tests were sloppy in setting up the initial cluster with the downloaded data directories. There was a window of time during which the pre-restore cluster and the post-restore cluster had nodes running simultaneously. See cockroachdb#7224.
The wrong plot thickened: The allocator tests were mixing two separate clusters and the panic was an artifact of RangeDescriptors in disagreement. #7625 |
The tests were sloppy in setting up the initial cluster with the downloaded data directories. There was a window of time during which the pre-restore cluster and the post-restore cluster had nodes running simultaneously. See cockroachdb#7224.
The tests were sloppy in setting up the initial cluster with the downloaded data directories. There was a window of time during which the pre-restore cluster and the post-restore cluster had nodes running simultaneously. See cockroachdb#7224.
Just a heads up that we lost the logs for the gamma backup. |
I believe this issue is no longer relevant since 2f1f092 went in. @tschottdorf? |
The underlying bug may still be relevant since it has never been resolved, but the mechanism by which it would fail would have changed completely. I'm ok with closing this issue since a lot of water has gone under the bridge since and there's nothing actionable (and we'll remember this issue should this happen again), but others may disagree. |
I'd vote to keep it open since this thread may still be useful in identifying the underlying bug, but I don't feel too strongly either way. |
Could you give this issue a new title? |
I'd rather not; "clobbering" is how I remember/search for this issue. Better to close it than retitle it at this point I think. |
Closing for now. |
Revisiting this in light of #9265: I don't think it's related. ChangeReplicas does have a bug that would allow |
summary:
gamma
cluster upgraded from 093e1e3 to 0b950ce crashed multiple nodes with stability: panic on basictracer.errAssertionFailed #7185104.196.98.24
) and 5 (104.196.39.184
) crashed with:Attempted upgrade to 0b950ce, still crashing with same trace.
Node 4 is crashing with:
Node 5 is crashing with:
The text was updated successfully, but these errors were encountered: