-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] storage: detect snapshots intended for a different replica ID #40459
[WIP] storage: detect snapshots intended for a different replica ID #40459
Conversation
Before this change we'd apply a snapshot over an existing initialized replica regardless of the replica ID its descriptor contained for the recipient store. This logic is problematic if the current store was removed and re-added and in the meantime the range subsumed one or more other ranges. This is specifically problematic because the code assumes in maybeAcquireSnapshotMergeLock that if the range descriptor grows then there will be collocated replicas to subsume. This invariant doesn't hold if the replica was removed and then added back after the range is modified. Release note: None
6c5b159
to
2787eb0
Compare
This decidedly doesn't work. |
The idea is roughly how I also thought we would solve this, but I don't think we can throw in a replicaGC when applying the snapshot - we need to do it when we first receive a message with a higher replicaID. Otherwise, we're clobbering raft state that the new incarnation of the raft member might have made. For example, assume the replica is re-added and there's an election going on. The first incoming request is an MsgVote from some candidate, under a replicaID that's newer than what the replica itself has. If we answer to the vote and then later run a replicaGC, we're erasing the vote - no good. Instead we need to run the replicaGC before feeding the MsgVote into the system. I'm curious what problems you're seeing with the current code, the above issue looked too subtle to me to be noticed in our unit tests. |
After stressing it for longer it still hit the previous assertion with this code. I'm not clear on how this happened but I went to bed before digging too deeply.
Makes sense. It seems like in cockroach/pkg/storage/store.go Lines 3940 to 3962 in 58d0fc3
Am I reading that code incorrectly? Would it be reasonable to explicitly do the remove in processRequestQueue if we observe the |
I think we need to do it pretty much exactly as I said, which is that whenever you see a raft message in which the replicaID is higher than that of the receiving replica, the receiving replica wipes its state before processing that message. I think what you're suggesting is that when you receive a ReplicaToOldError from the raft transport, you consider yourself for GC -- we already do that, but note that at this point you might've been readded already under a new replicaID and made promises under said ID. You need to react to learning about the new ID by wiping the state right there (well, or dropping these messages and queueing for GC, hoping that the GC queue will do the right thing and realize that it can GC the replica even though it is a member, because the replicaID is lower). Maybe try this "drop and gc queue" approach first as it's less invasive. It would look something like this diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 418c2cc531..49df70b75b 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -3585,6 +3585,23 @@ func (s *Store) processRequestQueue(ctx context.Context, rangeID roachpb.RangeID
var hadError bool
for i, info := range infos {
last := i == len(infos)-1
+ {
+ hackRepl, err := s.GetReplica(info.req.RangeID)
+ if err != nil {
+ // probably need to handle RangeNotFound by proceeding
+ }
+ // NB: need to do this whole block under a lock so the repl cant be created
+ // under a lower replicaID under us now, will be hard to do this correctly
+ hackRepl.mu.Lock()
+ replID := hackRepl.mu.replicaID
+ hackRepl.mu.Unlock()
+
+ if replID < info.req.ToReplica.ReplicaID {
+ // drop this message, queue for GC
+ s.replicaGCQueue.MaybeAddAsync(ctx, hackRepl, s.cfg.Clock.Now())
+ continue
+ }
+ }
pErr := s.withReplicaForRequest(ctx, info.req, func(ctx context.Context, r *Replica) *roachpb.Error {
// Save the last Replica we see, since we don't know in advance which
// requests will fail during Replica retrieval. We want this later |
Now that I look at your post again, I think you weren't suggesting what I thought. Did you mix up the cases? What you're describing looks like the case in which a replica receives a message from a peer that is out of date (and you want to ignore that message and tell the peer it should consider a GC). In the situation we're describing, we are a peer that learns that it is out of date but does so in the same moment that it also learns that it's a peer again under a new ID. |
I did indeed mix up which replica we were looking at in |
seems like a reasonable spot to detect the replica ID mismatch.
Yep, that's where it would have to sort of live in the end! I just wasn't
able to prototype anything into that tangle. :)
|
Without the preceding commit, this fatals as seen in cockroachdb#39796: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split Note that cockroachdb#40459 will break these tests since it will not allow the replicaID to change without an intermittent GC cycle, but we now rely on catching up via the raft log across a removal, split, and then re-add. At that point, the only way in which a split trigger could execute while in non-member status is if a replica got removed but would still receive a later split trigger. This could happen as follows: The log is `[..., remove, split]` a) the follower receives `split` b) it learns that `split` is committed so that c) it will apply both `remove` and then `split`. a) seems possible; if the removed follower is trailing a bit, the leader may append both entries at the same time before it (the leader) applies the replication change that removes the follower from the leader's config. b) seems possible like above, at least once we do async command application, if said append carries a commit index encompassing the append itself. c) will then naturally happen in today's code. There's no way we can reasonably enact this scenario in our testing at this point, though. The suggestion for cockroachdb#40459 is to essentially revert this commit, so that we at least retain the coverage related to the RHS of this test. Release note (bug fix): Avoid crashes with message "replica descriptor of local store not found".
Without the preceding commit, this fatals as seen in cockroachdb#39796: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split Note that cockroachdb#40459 will break these tests since it will not allow the replicaID to change without an intermittent GC cycle, but we now rely on catching up via the raft log across a removal, split, and then re-add. At that point, the only way in which a split trigger could execute while in non-member status is if a replica got removed but would still receive a later split trigger. This could happen as follows: The log is `[..., remove, split]` a) the follower receives `split` b) it learns that `split` is committed so that c) it will apply both `remove` and then `split`. a) seems possible; if the removed follower is trailing a bit, the leader may append both entries at the same time before it (the leader) applies the replication change that removes the follower from the leader's config. b) seems possible like above, at least once we do async command application, if said append carries a commit index encompassing the append itself. c) will then naturally happen in today's code. There's no way we can reasonably enact this scenario in our testing at this point, though. The suggestion for cockroachdb#40459 is to essentially revert this commit, so that we at least retain the coverage related to the RHS of this test. Release note (bug fix): Avoid crashes with message "replica descriptor of local store not found".
Superseded by #40751 |
Before this change we'd apply a snapshot over an existing initialized replica
regardless of the replica ID its descriptor contained for the recipient store.
This logic is problematic if the current store was removed and re-added and in
the meantime the range subsumed one or more other ranges. This is specifically
problematic because the code assumes in maybeAcquireSnapshotMergeLock that if
the range descriptor grows then there will be collocated replicas to subsume.
This invariant doesn't hold if the replica was removed and then added back
after the range is modified.
Release note: None