-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: raft panic after process death #9037
Comments
Could be related to #9029. |
This reverts commit a5ce7dd. Moving to 4.9 introduced some bugs: cockroachdb#9029 and possibly cockroachdb#9037. Reverting pending investigation.
Assigning to @bdarnell as I believe he's looking into this at the moment. |
still poking around? There are races and panics I'd like to investigate, but I don't want to disrupt things. |
discussed offline: we should retain a copy of the failing data's node but there seems to be no good reason to not wipe it after. |
This panic means that the leader believes this node has log entry 20055 (i.e. it has sent a MsgAppResp for this entry), but the node only has up to 20053 (and the data on disk appears to be consistent with this). It doesn't look like this node was recently added to the range, so snapshots shouldn't be involved.
I don't think there's more we could get out of the data directory. The one thing that would be interesting is to look at the rocksdb WAL to see if the missing log entries are there, but that log has already been compacted away by the restart attempts and/or my use of debugging tools. |
Looking at the logs between rocksdb 4.8 and 4.9: facebook/rocksdb@0460e9dcc is immediately suspicious, since it changes the WriteBatch format, although it looks like it doesn't affect batches that don't use the new two-phase commit features. (note that there is another version of this commit in the history which was more intrusive, but it was subsequently reverted). And we have a test (storage/engine/batch_test.go) which is supposed to detect any relevant changes to the batch encoding. |
I don't think we can blame this on rocksdb 4.9 - we don't have enough history with unclean shutdowns to say that they were working fine before. We could put rocksdb in a stricter mode which would cause it to fail at startup instead of shortly after startup. The root cause is either that we're not fsyncing when we're supposed to, or that fsync isn't going all the way to the disk (whatever that means in the GCE environment. FWIW, this machine is using the "local SSD" option with an ext4 filesystem). As far as I can tell, all the appropriate durability options are set by default and rocksdb is calling |
I am occasionally able to reproduce this on GCE after applying #10583:
Note that the "ADD" and "DEL" logs are something I added in pursuit of #9017. |
That's scary. Would be good to reproduce this with |
OK, with #10634 applied I was able to reproduce this with the extra logging: https://gist.github.com/tamird/914d125022cfa7cc40248d07c2ce4455 |
This reproduces in about 30 seconds on GCE with the following diff: diff --git a/pkg/storage/client_raft_test.go b/pkg/storage/client_raft_test.go
index f1aac22..7c6be7f 100644
--- a/pkg/storage/client_raft_test.go
+++ b/pkg/storage/client_raft_test.go
@@ -1994,11 +1994,21 @@ func TestRaftAfterRemoveRange(t *testing.T) {
// number of repetitions adds an unacceptable amount of test runtime).
func TestRaftRemoveRace(t *testing.T) {
defer leaktest.AfterTest(t)()
- mtc := startMultiTestContext(t, 3)
+
+ const numStores = 10
+ mtc := startMultiTestContext(t, numStores)
defer mtc.Stop()
- rangeID := roachpb.RangeID(1)
- mtc.replicateRange(rangeID, 1, 2)
+ const rangeID = roachpb.RangeID(1)
+
+ {
+ // Replicate the range to all stores in a single call because it's faster.
+ var storeIndexes []int
+ for i := 1; i < numStores; i++ {
+ storeIndexes = append(storeIndexes, i)
+ }
+ mtc.replicateRange(rangeID, storeIndexes...)
+ }
for i := 0; i < 10; i++ {
mtc.unreplicateRange(rangeID, 2) With the invocation (on GCE):
The increased number of stores appears to increase the probability. From what I can tell, we're applying a snapshot that causes the loss of one or more raft log entries. |
|
A reduced failure run: https://gist.githubusercontent.com/tamird/ce2397a2dccb691db0e42fe984a94475/raw/9fd8c190d8576d85ffbf5f365ab2a7f1e82f1633/test.log This run also exhibits the same duplicate raft ID. |
Here's what's going on: When we apply a preemptive snapshot, we nil out the replica's raft group, to be recreated by the next raft message. Normally, that's going to be a message from the new incarnation of the range. However, in this case, we get a message for the old incarnation (which has been held up in some queue somewhere), so we recreate the raft group with its old id. This almost works (dropping and recreating the raft group is similar to what happens when a node restarts) but the snapshot process has discarded the uncommitted tail of the log (the snapshot itself only contains committed log entries, on the assumption that uncommitted entries can be replayed from the leader, and we delete everything related to the previous incarnation of the range except its HardState). When the leader asks this replica to commit an entry (which that replica had previously acknowledged writing to its log) and it's not there, it panics. I think there are two solutions. One option would be to preserve the uncommitted tail of the log across preemptive snapshots (except to the extent that it conflicts with newer committed entries). I think this would be a fairly simple code change, but it feels dirty to combine log entries from two different sources like this. The alternative would be to ensure that after we apply a preemptive snapshot, the replica cannot be used under its old id any more. The simplest way to do this would be to refuse any preemptive snapshots while an older replica still exists (and queue it up for an immediate GC). |
Not sure I understand this proposal. How would you know when the older replica is definitively gone and no more messages will be received for it? |
When a replica is GCed, we write a tombstone with its replica ID, and that makes us ignore any future messages for that replica ID. |
Looks to me like the tombstone prevents the creation of a replica with an old replica-id, but it does not prevent ignoring messages with an older replica ID on a newer replica. That is accomplished by I think preemptive snapshots created a hole in the handling of message from old replicas. The preemptive snapshot can create the
|
No, we can't just use tombstone.NextReplicaID as our replica ID - that ID has (probably) been assigned to someone else (and I don't think we even hit this code path for the case we're discussing here). We need to A) respect the tombstone earlier in tryGetOrCreateReplica when we're calling setReplicaID and B) write a tombstone when a preemptive snapshot is applied over an older replica. |
We don't want to load the tombstone on every call to
How can we be applying a preemptive snapshot over an older replica? I could have sworn we rejected preemptive snapshots if the replica already existed and was initialized, but it appears we don't. |
We'd only need to load the tombstone in tryGetOrCreateReplica if we're going to call setReplicaID (that is, if the incoming message's replica ID differs from the one in our Replica object). But a minReplicaID field could work too. We used to disallow preemptive snapshots when the raft group existed, but I changed that in #8613 when we stopped ignoring errors from preemptive snapshots. This was necessary to get the tests to pass, but there are probably other solutions we could use instead. |
@bdarnell Unless you've already started on this, I can take a crack at fixing it. |
Go ahead; I'm working on #10602. |
Reproduced on my laptop with:
Same error as @tamird was seeing:
Now to fix. |
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Introduce Replica.mu.minReplicaID which enforces the tombstone invariant that we don't accept messages for previous Replica incarnations. Make TestRaftRemoveRace more aggressive. Fixes cockroachdb#9037
Fixes cockroachdb#16376. Under stress TestRaftRemoveRace failed with: --- FAIL: TestRaftRemoveRace (1.45s) client_test.go:1031: change replicas of r1 failed: quota pool no longer in use Consider the following: - 'add replica' commands get queued up on the replicate queue - leader replica steps down as leader due to timeouts thus closing the quota pool - commands get out of the queue, cannot acquire quota because the quota pool is closed and fail with an error indicating so TestRaftRemoveRace fails as it expects all replica additions to go through without failure. To reproduce this the following minimal test (run under stress) is sufficient where we lower RaftTickInterval and RaftElectionTimeoutTicks to make it more likely that leadership changes take place. func TestReplicateQueueQuota(t *testing.T) { defer leaktest.AfterTest(t)() sc := storage.TestStoreConfig(nil) sc.RaftElectionTimeoutTicks = 2 // Default: 15 sc.RaftTickInterval = 10 * time.Millisecond // Default: 200ms mtc := &multiTestContext{storeConfig: &sc} defer mtc.Stop() mtc.Start(t, 3) const rangeID = roachpb.RangeID(1) mtc.replicateRange(rangeID, 1, 2) for i := 0; i < 10; i++ { mtc.unreplicateRange(rangeID, 2) mtc.replicateRange(rangeID, 2) } } The earlier version of TestRaftRemoveRace was written to reproduce the failure seen in cockroachdb#9037, this was agnostic to raft leadership changes.
rho-1 (
104.196.18.63
) was killed uncleanly (I was trying to replace supervisord with a newer version and reset the machine to get around wedged systemctl).build sha: 1a34ca3 with race-detector enabled.
After the machine came back, cockroach now dies at startup with:
Here is the log, including the last few attempts to start it and the initial death around I160901 16:46:15
cockroach.stderr.txt
The process is currently down, so feel free to poke around.
The text was updated successfully, but these errors were encountered: