[WIP] storage: detect snapshots intended for a different replica ID #40459

ajwerner · 2019-09-04T06:11:09Z

Before this change we'd apply a snapshot over an existing initialized replica
regardless of the replica ID its descriptor contained for the recipient store.
This logic is problematic if the current store was removed and re-added and in
the meantime the range subsumed one or more other ranges. This is specifically
problematic because the code assumes in maybeAcquireSnapshotMergeLock that if
the range descriptor grows then there will be collocated replicas to subsume.
This invariant doesn't hold if the replica was removed and then added back
after the range is modified.

Release note: None

cockroach-teamcity · 2019-09-04T06:11:15Z

This change is

Before this change we'd apply a snapshot over an existing initialized replica regardless of the replica ID its descriptor contained for the recipient store. This logic is problematic if the current store was removed and re-added and in the meantime the range subsumed one or more other ranges. This is specifically problematic because the code assumes in maybeAcquireSnapshotMergeLock that if the range descriptor grows then there will be collocated replicas to subsume. This invariant doesn't hold if the replica was removed and then added back after the range is modified. Release note: None

ajwerner · 2019-09-04T07:03:07Z

This decidedly doesn't work.

tbg · 2019-09-04T07:22:37Z

The idea is roughly how I also thought we would solve this, but I don't think we can throw in a replicaGC when applying the snapshot - we need to do it when we first receive a message with a higher replicaID. Otherwise, we're clobbering raft state that the new incarnation of the raft member might have made.

For example, assume the replica is re-added and there's an election going on. The first incoming request is an MsgVote from some candidate, under a replicaID that's newer than what the replica itself has. If we answer to the vote and then later run a replicaGC, we're erasing the vote - no good. Instead we need to run the replicaGC before feeding the MsgVote into the system.

I'm curious what problems you're seeing with the current code, the above issue looked too subtle to me to be noticed in our unit tests.

ajwerner · 2019-09-04T14:02:17Z

I'm curious what problems you're seeing with the current code, the above issue looked too subtle to me to be noticed in our unit tests.

After stressing it for longer it still hit the previous assertion with this code. I'm not clear on how this happened but I went to bed before digging too deeply.

out.1567577356.txt

we need to do it when we first receive a message with a higher replicaID. Otherwise, we're clobbering raft state that the new incarnation of the raft member might have made.

Makes sense. It seems like in processRequestQueue() we use withReplicaForRequest() which seems like it should return a replica too old error.

cockroach/pkg/storage/store.go

Lines 3940 to 3962 in 58d0fc3

    
           var replTooOldErr error 
        
           if creatingReplica != nil { 
        
           	// Drop messages that come from a node that we believe was once a member of 
        
           	// the group but has been removed. 
        
           	desc := repl.mu.state.Desc 
        
           	_, found := desc.GetReplicaDescriptorByID(creatingReplica.ReplicaID) 
        
           	// It's not a current member of the group. Is it from the past? 
        
           	if !found && creatingReplica.ReplicaID < desc.NextReplicaID { 
        
           		replTooOldErr = roachpb.NewReplicaTooOldError(creatingReplica.ReplicaID) 
        
           	} 
        
           } 
        
           var err error 
        
           if replTooOldErr != nil { 
        
           	err = replTooOldErr 
        
           } else if ds := repl.mu.destroyStatus; ds.reason == destroyReasonRemoved { 
        
           	err = errRetry 
        
           } else { 
        
           	err = repl.setReplicaIDRaftMuLockedMuLocked(replicaID) 
        
           } 
        
           if err != nil { 
        
           	repl.raftMu.Unlock() 
        
           	return nil, false, err

Am I reading that code incorrectly? Would it be reasonable to explicitly do the remove in processRequestQueue if we observe the ReplicaTooOldError? I suspect that would still need to be coupled with logic to detect the snapshot with a later replica id.

tbg · 2019-09-04T14:18:30Z

I think we need to do it pretty much exactly as I said, which is that whenever you see a raft message in which the replicaID is higher than that of the receiving replica, the receiving replica wipes its state before processing that message.

I think what you're suggesting is that when you receive a ReplicaToOldError from the raft transport, you consider yourself for GC -- we already do that, but note that at this point you might've been readded already under a new replicaID and made promises under said ID. You need to react to learning about the new ID by wiping the state right there (well, or dropping these messages and queueing for GC, hoping that the GC queue will do the right thing and realize that it can GC the replica even though it is a member, because the replicaID is lower).

Maybe try this "drop and gc queue" approach first as it's less invasive. It would look something like this

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 418c2cc531..49df70b75b 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -3585,6 +3585,23 @@ func (s *Store) processRequestQueue(ctx context.Context, rangeID roachpb.RangeID
 	var hadError bool
 	for i, info := range infos {
 		last := i == len(infos)-1
+		{
+			hackRepl, err := s.GetReplica(info.req.RangeID)
+			if err != nil {
+				// probably need to handle RangeNotFound by proceeding
+			}
+			// NB: need to do this whole block under a lock so the repl cant be created
+			// under a lower replicaID under us now, will be hard to do this correctly
+			hackRepl.mu.Lock()
+			replID := hackRepl.mu.replicaID
+			hackRepl.mu.Unlock()
+
+			if replID < info.req.ToReplica.ReplicaID {
+				// drop this message, queue for GC
+				s.replicaGCQueue.MaybeAddAsync(ctx, hackRepl, s.cfg.Clock.Now())
+				continue
+			}
+		}
 		pErr := s.withReplicaForRequest(ctx, info.req, func(ctx context.Context, r *Replica) *roachpb.Error {
 			// Save the last Replica we see, since we don't know in advance which
 			// requests will fail during Replica retrieval. We want this later

tbg · 2019-09-04T14:21:02Z

Now that I look at your post again, I think you weren't suggesting what I thought. Did you mix up the cases? What you're describing looks like the case in which a replica receives a message from a peer that is out of date (and you want to ignore that message and tell the peer it should consider a GC). In the situation we're describing, we are a peer that learns that it is out of date but does so in the same moment that it also learns that it's a peer again under a new ID.

ajwerner · 2019-09-04T15:03:28Z

What you're describing looks like the case in which a replica receives a message from a peer that is out of date (and you want to ignore that message and tell the peer it should consider a GC).

I did indeed mix up which replica we were looking at in tryGetOrCreateReplica. Nevertheless underneath withReplicaForRequest (as opposed to above it as you suggest) seems like a reasonable spot to detect the replica ID mismatch.

tbg · 2019-09-04T15:07:21Z

seems like a reasonable spot to detect the replica ID mismatch.

Yep, that's where it would have to sort of live in the end! I just wasn't able to prototype anything into that tangle. :)

Without the preceding commit, this fatals as seen in cockroachdb#39796: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split Note that cockroachdb#40459 will break these tests since it will not allow the replicaID to change without an intermittent GC cycle, but we now rely on catching up via the raft log across a removal, split, and then re-add. At that point, the only way in which a split trigger could execute while in non-member status is if a replica got removed but would still receive a later split trigger. This could happen as follows: The log is `[..., remove, split]` a) the follower receives `split` b) it learns that `split` is committed so that c) it will apply both `remove` and then `split`. a) seems possible; if the removed follower is trailing a bit, the leader may append both entries at the same time before it (the leader) applies the replication change that removes the follower from the leader's config. b) seems possible like above, at least once we do async command application, if said append carries a commit index encompassing the append itself. c) will then naturally happen in today's code. There's no way we can reasonably enact this scenario in our testing at this point, though. The suggestion for cockroachdb#40459 is to essentially revert this commit, so that we at least retain the coverage related to the RHS of this test. Release note (bug fix): Avoid crashes with message "replica descriptor of local store not found".

ajwerner · 2019-09-13T21:52:26Z

Superseded by #40751

ajwerner mentioned this pull request Sep 4, 2019

storage: crash with snapshot widens existing replica, but no replica exists for subsumed key #40257

Closed

ajwerner force-pushed the ajwerner/debug-subsume-fatal-after-second-snapshot branch from 6c5b159 to 2787eb0 Compare September 4, 2019 07:00

tbg mentioned this pull request Sep 4, 2019

roachtest: clearrange/checks=false failed #38772

Closed

tbg mentioned this pull request Sep 6, 2019

storage: must never reuse state from old replicaIDs #40367

Closed

ajwerner closed this Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] storage: detect snapshots intended for a different replica ID #40459

[WIP] storage: detect snapshots intended for a different replica ID #40459

ajwerner commented Sep 4, 2019

cockroach-teamcity commented Sep 4, 2019

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019 •

edited

Loading

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019 via email

ajwerner commented Sep 13, 2019

[WIP] storage: detect snapshots intended for a different replica ID #40459

[WIP] storage: detect snapshots intended for a different replica ID #40459

Conversation

ajwerner commented Sep 4, 2019

cockroach-teamcity commented Sep 4, 2019

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019

tbg commented Sep 4, 2019 • edited Loading

ajwerner commented Sep 4, 2019

tbg commented Sep 4, 2019 via email

ajwerner commented Sep 13, 2019

tbg commented Sep 4, 2019 •

edited

Loading