storage: ensure Replica objects never change replicaID #40892

ajwerner · 2019-09-18T21:53:13Z

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

Once a Replica object has a non-zero Replica.mu.replicaID it will not
change.
If a raft message or snapshot addressed to a higher replica ID is received
the current replica will be removed completely.
If a replica sees a ChangeReplicasTrigger which removes it then it
completely removes itself while applying that command.
Replica.mu.destroyStatus is used to meaningfully signify the removal state
of a Replica. Replicas about to be synchronously removed are in
destroyReasonRemovalPending.

This hopefully gives us some new invariants:

There is only ever at most 1 *Replica which IsAlive() for a range on a store
at a time.
Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes #40367.

The first commit is #40889.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.

cockroach-teamcity · 2019-09-18T21:53:22Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/store.go, line 2368 at r2 (raw file):

		batch := rightRepl.Engine().NewWriteOnlyBatch()
		defer batch.Close()
		if err := clearRangeData(oldRightDesc, rightRepl.Engine(), batch, clearReplicatedOnly, false); err != nil {

This code needs to clear the truncated state too. In practice is should load the hard state, clear everything and then write back a hard state with a 0 Commit

ajwerner

This is RFAL.

Reviewable status: complete! 0 of 0 LGTMs obtained

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/client_raft_test.go, line 4785 at r3 (raw file):

			// is sent first (we block raft snapshots from being sent while sending
			// the learner snapshot but not the other way around).
			testutils.SucceedsSoon(t, func() error {

It seems like #40435 should have dealt with it. Perhaps it's because the raft snapshot is occurring from the raft leader which isn't the leaseholder.

nvanbenschoten

Getting pulled into an interview so I'll leave this here. Made it most of the way through non-testing changes.

Reviewed 12 of 27 files at r2, 6 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)

pkg/storage/client_test.go, line 1246 at r3 (raw file):

}

func (m *multiTestContext) waitForUnreplicated(rangeID roachpb.RangeID, dest int) error {

Give this method a comment.

pkg/storage/queue.go, line 670 at r3 (raw file):

		log.Infof(ctx, "adding: priority=%0.3f", priority)
	}

stray line

pkg/storage/replica.go, line 1056 at r3 (raw file):

		return r.mu.replicaID
	}
	return r.mu.minReplicaID

Could you rename minReplicaID to ensure that we're always using this when we should be?

pkg/storage/replica_application_result.go, line 300 at r3 (raw file):

) (changeRemovedReplica bool) {
	// If this command removes us then we need to go through the process of
	// removing our replica from the store. After this method returns the code

nit: add a comma after "returns"

pkg/storage/replica_application_result.go, line 316 at r3 (raw file):

		log.Infof(ctx, "removing replica due to ChangeReplicasTrigger: %v", chng)
	}
	if err := r.postDestroyRaftMuLocked(ctx, r.GetMVCCStats()); err != nil {

Just like in destroyRaftMuLocked, we need to make sure that the removal has been durably sync'ed before we get here, right? We can plumb this logic up and sync when we apply the application batch, but it would probably be easiest to just sync the RocksDB WAL directly right before calling postDestroyRaftMuLocked.

pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):

		return false
	}
	_, existsInChange := change.Desc.GetReplicaDescriptor(storeID)

I think this missed my comment from the other PR. This isn't backward compatible.

pkg/storage/replica_application_state_machine.go, line 595 at r3 (raw file):

		const mustClearRange = false
		if err := rhsRepl.preDestroyRaftMuLocked(
			ctx, b.batch, b.batch, roachpb.ReplicaID(math.MaxInt32), clearRangeIDLocalOnly, mustClearRange,

Can we pull math.MaxInt32 into a constant somewhere and use it across all calls?

pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):

	// We'll complete the removal when it commits. Later logic detects the
	// removal by inspecting the destroy status
	if change := res.ChangeReplicas; change != nil &&

Pull this up below the res.Merge case.

pkg/storage/replica_application_state_machine.go, line 645 at r3 (raw file):

		b.r.mu.Unlock()
		b.changeRemovesReplica = true
		const mustUseRangeDeletionTombstone = true

Why? If we don't have a good reason for this then set it to false.

pkg/storage/replica_application_state_machine.go, line 793 at r3 (raw file):

// have been applied as of this batch. It also records the Range's mvcc stats.
func (b *replicaAppBatch) addAppliedStateKeyToBatch(ctx context.Context) error {
	if b.changeRemovesReplica {

Give this a comment.

pkg/storage/replica_application_state_machine.go, line 933 at r3 (raw file):

		// and the first range has a lease that there will not be a later hard-state.
		if shouldAssert {
			sm.r.mu.Lock()

nit: move back

pkg/storage/replica_application_state_machine.go, line 1026 at r3 (raw file):

	shouldAssert = !rResult.Equal(storagepb.ReplicatedEvalResult{})
	if !shouldAssert {
		return false, sm.batch.changeRemovesReplica

return false, false

We shouldn't be referencing the sm.batch. It's only a member of the state machine to avoid allocations. Also, we know this is false because otherwise rResult.ChangeReplicas != nil

pkg/storage/replica_application_state_machine.go, line 1094 at r3 (raw file):

			return nil
		}
		// If we're removed then we know that must have happened due to this

Why is this ever possible? If we were removed then how would we be here calling maybeApplyConfChange? Can't we assert that this is false?

pkg/storage/replica_destroy.go, line 71 at r3 (raw file):

// RemovalPending returns whether the replica is removed or in the process of
// being removed.
func (s destroyStatus) RemovalPending() bool {

RemovingOrRemoved? This name implies that this is implemented as return s.reason == destroyReasonRemovalPending.

pkg/storage/replica_destroy.go, line 145 at r3 (raw file):

	// Clear the range ID local data including the hard state.
	// We don't know about any user data so we can't clear any user data.
	// See the comment on this method for why this is safe.

Make a note that this is the only real reason why this method is separate from Replica.destroyRaftMuLocked.

pkg/storage/replica_destroy.go, line 166 at r3 (raw file):

	}

	if r.raftMu.sideloaded != nil {

Is it worth adjusting postDestroyRaftMuLocked to not suggest a compaction when the stats are empty and then just calling that here?

pkg/storage/replica_gc_queue.go, line 290 at r3 (raw file):

		nextReplicaID := replyDesc.NextReplicaID
		if currentMember {
			// If we're a member of the current range descriptor then we must not put

Why don't we just always put a tombstone down at currentDesc.NextReplicaID? Depending on the answer to that question, should this instead be:

if currentMember {
    nextReplicaID--
}

pkg/storage/replica_gc_queue.go, line 292 at r3 (raw file):

			// If we're a member of the current range descriptor then we must not put
			// down a tombstone at replyDesc.NextReplicaID as that would prevent us
			// from getting a snapshot and becoming a part of the rang.e

"rang.e"

pkg/storage/replica_init.go, line 171 at r3 (raw file):

		log.Fatalf(ctx, "cannot set replica ID from anything other than 0, currently %d",
			r.mu.replicaID)
		// The common case: the replica ID is unchanged.

Outdated comment.

pkg/storage/replica_init.go, line 249 at r3 (raw file):

// not yet been initialized. If not, it is created and set to campaign
// if this replica is the most recent owner of the range lease.
func (r *Replica) maybeInitializeRaftGroup(ctx context.Context) {

Leave a comment about why we don't do anything if the replica is already destroyed. Something about how we will check this later on.

pkg/storage/replica_raftstorage.go, line 684 at r4 (raw file):

const (
	clearRangeIDLocalOnly clearRangeOption = iota

It would be helpful to have an example of why each of these would be used.

pkg/storage/replica_raftstorage.go, line 690 at r4 (raw file):

// clearRangeData clears the data associated with a range descriptor. If
// rangeIDLocalOnly is true, then only the range-id local keys are deleted.

This comment is now stale.

pkg/storage/store.go, line 2143 at r3 (raw file):

		} else if ok {
			log.Infof(ctx, "split trigger found right-hand side with tombstone %+v, "+
				"RHS must have been removed: %v", tombstone, rightRng)

How is this case handled? I thought we were supposed to fall into the same case a if rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {, but it's not immediately obvious that it does. Could you call this out?

pkg/storage/store.go, line 2155 at r3 (raw file):

			}
		}
		// If we're in this case then we know that the RHS has since been removed

nit: move this comment into the if since it's about "this case".

pkg/storage/store.go, line 2278 at r3 (raw file):

// of a Raft command.
//
// The funky case is if rightDesc is non-nil. If so then we know that the right

s/right/Desc/oldRightDesc/

Also, is this the best name? Why is it "old"? Why does splitPostApply pass the descriptor from the SplitTrigger, but only sometimes?

pkg/storage/store.go, line 2279 at r3 (raw file):

//
// The funky case is if rightDesc is non-nil. If so then we know that the right
// replica has been removed and re-added before applying this split and we need

"re-added as an uninitialized replica"

pkg/storage/store.go, line 2381 at r3 (raw file):

			return errors.Wrap(err, "failed to set hard state with 0 commit index for removed rhs")
		}
		if err := batch.Commit(true); err != nil {

This is all in the splitPostApply, right? So what would happen if the node crashed before getting here but after its split application batch had made it to disk?

pkg/storage/store.go, line 4175 at r3 (raw file):

			// sender that they may no longer exist.
			err = roachpb.NewRangeNotFoundError(rangeID, s.StoreID())
		} else if replicaID != 0 && repl.mu.replicaID != replicaID {

I don't understand this case. Why is this unexpected? Because we would have caught this in handleToReplicaTooOld?

ajwerner

Only made it though half of your comments before heading out to the scavenger hunt. Will finish after.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)

pkg/storage/client_test.go, line 1246 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this method a comment.

Done.

pkg/storage/queue.go, line 670 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

stray line

Done.

pkg/storage/replica.go, line 1056 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you rename minReplicaID to ensure that we're always using this when we should be?

Renamed to tombstoneReplicaID

pkg/storage/replica_application_result.go, line 300 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add a comma after "returns"

Done.

pkg/storage/replica_application_result.go, line 316 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Just like in destroyRaftMuLocked, we need to make sure that the removal has been durably sync'ed before we get here, right? We can plumb this logic up and sync when we apply the application batch, but it would probably be easiest to just sync the RocksDB WAL directly right before calling postDestroyRaftMuLocked.

It was easy enough to move the sync to when we commit the batch.

pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think this missed my comment from the other PR. This isn't backward compatible.

Done.

pkg/storage/replica_application_state_machine.go, line 595 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we pull math.MaxInt32 into a constant somewhere and use it across all calls?

Done.

pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Pull this up below the res.Merge case.

It feels like logically the truncated state change could write to the batch. Why just below merge?

pkg/storage/replica_application_state_machine.go, line 645 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why? If we don't have a good reason for this then set it to false.

Done.

pkg/storage/replica_application_state_machine.go, line 793 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment.

Lifted it instead and commented there.

pkg/storage/replica_application_state_machine.go, line 933 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move back

Done.

pkg/storage/replica_application_state_machine.go, line 1026 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

return false, false

We shouldn't be referencing the sm.batch. It's only a member of the state machine to avoid allocations. Also, we know this is false because otherwise rResult.ChangeReplicas != nil

Done.

pkg/storage/replica_application_state_machine.go, line 1094 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this ever possible? If we were removed then how would we be here calling maybeApplyConfChange? Can't we assert that this is false?

Comment is stale, there is an assertion below. Updated.

pkg/storage/replica_destroy.go, line 71 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

RemovingOrRemoved? This name implies that this is implemented as return s.reason == destroyReasonRemovalPending.

Done.

pkg/storage/replica_destroy.go, line 166 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it worth adjusting postDestroyRaftMuLocked to not suggest a compaction when the stats are empty and then just calling that here?

It probably is.

ajwerner

TFTR! I moved the nasty hacks that had been added to SplitRange into splitPreApply and I think it's better.

It's ready for another look.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/replica_destroy.go, line 145 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Make a note that this is the only real reason why this method is separate from Replica.destroyRaftMuLocked.

I've folded the whole thing up into destroyReplicaRaftMuLocked, I think it's better but I could go either way.

pkg/storage/replica_gc_queue.go, line 290 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why don't we just always put a tombstone down at currentDesc.NextReplicaID? Depending on the answer to that question, should this instead be:
if currentMember {
    nextReplicaID--
}

My bad. This is stale, this can never be the case. notice in the conditional above we cannot be a current member. This is literally the not current member case.

pkg/storage/replica_gc_queue.go, line 292 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"rang.e"

Gone.

pkg/storage/replica_init.go, line 171 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Outdated comment.

Done.

pkg/storage/replica_init.go, line 249 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a comment about why we don't do anything if the replica is already destroyed. Something about how we will check this later on.

I added a comment but it could be improved.

pkg/storage/replica_raftstorage.go, line 684 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It would be helpful to have an example of why each of these would be used.

Removed.

pkg/storage/replica_raftstorage.go, line 690 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment is now stale.

not anymore 😉

pkg/storage/store.go, line 2143 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How is this case handled? I thought we were supposed to fall into the same case a if rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {, but it's not immediately obvious that it does. Could you call this out?

It is, it's a subset of that case. I don't know why we're reading the tombstone anymore. I removed it.

pkg/storage/store.go, line 2278 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/right/Desc/oldRightDesc/

Also, is this the best name? Why is it "old"? Why does splitPostApply pass the descriptor from the SplitTrigger, but only sometimes?

Removed.

pkg/storage/store.go, line 2279 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"re-added as an uninitialized replica"

Removed.

pkg/storage/store.go, line 2381 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is all in the splitPostApply, right? So what would happen if the node crashed before getting here but after its split application batch had made it to disk?

oof that's totally right. I was sort of worried about such things myself. I moved this whole set of cases into split preApply so that they happen atomically with the batch. I think it generally improves the change.

pkg/storage/store.go, line 4175 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't understand this case. Why is this unexpected? Because we would have caught this in handleToReplicaTooOld?

Yes, that or the above case.

nvanbenschoten

Reviewed 8 of 27 files at r2, 1 of 1 files at r4, 15 of 15 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)

pkg/storage/client_raft_test.go, line 3004 at r5 (raw file):

	// partition node 0 from the others.
	partRange, err := setupPartitionedRange(mtc, 1, 0)
	require.NoError(t, err)

nit: add a new line after this so that it's clear that this is still setup and not actually partitioning yet.

pkg/storage/client_raft_test.go, line 4618 at r5 (raw file):

split and it

make it a little clearer that these are two different cases

pkg/storage/client_raft_test.go, line 4748 at r5 (raw file):

			require.True(t, store0Exists)

			// Now we should be able to have the RHS in

missing part of this comment

pkg/storage/client_raft_test.go, line 4750 at r5 (raw file):

			// Now we should be able to have the RHS in
			require.NoError(t, changeReplicas(roachpb.REMOVE_REPLICA, keyB, 0))
			require.NotNil(t, changeReplicas(roachpb.ADD_REPLICA, keyB, 0))

Could you add a // unsuccessful here? The bottom few lines assume that the replica was not re-added here, but scanning through this test indicates the opposite.

pkg/storage/client_raft_test.go, line 4753 at r5 (raw file):

			// Read the HardState of the RHS to ensure that we've written a tombstone
			// with our old replica ID and then confirm that we've voted.

Why would we automatically vote? We didn't call an election or anything, right?

pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):

Previously, ajwerner wrote…

Done.

Think it's worth keeping the first case at all? This isn't performance-critical whatsoever.

pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):

Previously, ajwerner wrote…

It feels like logically the truncated state change could write to the batch. Why just below merge?

I was thinking we should keep the Range configuration triggers together. No strong opinion though.

Are you worried about a command that truncates AND removes the replica?

pkg/storage/replica_application_state_machine.go, line 609 at r5 (raw file):

		}
		const mustClearRange = false
		const clearRangeIDLocalOnly = true

nit: re-order with above

pkg/storage/replica_application_state_machine.go, line 760 at r5 (raw file):

	// the applied state is stored in this batch, ensure that if the batch ends
	// up not being durably committed then the entries in this batch will be
	// applied again upon startup. If we're removing the replica's data then we

"However, if we're removing ..."

pkg/storage/replica_destroy.go, line 145 at r3 (raw file):

Previously, ajwerner wrote…

I've folded the whole thing up into destroyReplicaRaftMuLocked, I think it's better but I could go either way.

Yeah, that's better.

pkg/storage/replica_destroy.go, line 82 at r5 (raw file):

// know new replicas can never be created so this value is used even if we
// don't know the current replica ID.
const mergedTombstoneReplicaID roachpb.ReplicaID = math.MaxInt32

We should also adopt part of the comment from #40814 (review) about this not being required for correctness.

Also, are you planning on merging that PR?

pkg/storage/replica_gc_queue.go, line 225 at r5 (raw file):

	// Maybe it was deleted "under us" by being moved.
	currentDesc, currentMember := replyDesc.GetReplicaDescriptor(repl.store.StoreID())
	if desc.RangeID == replyDesc.RangeID && currentMember {

nit: not your change, but pull desc.RangeID == replyDesc.RangeID into a variable so it's more clear that the second case is the sameRange && !currentMember case.

pkg/storage/replica_learner_test.go, line 439 at r5 (raw file):

	tc.RemoveReplicasOrFatal(t, scratchStartKey, tc.Target(1)) // remove learner
	// We need to wait until the LHS range has been GC'd on server 1 before the
	// RHS can be successfully added. Wait for the RHS to exist.

What are the LHS and RHS in this test? Are these supposed to be referring to the first and second replica on node 2?

pkg/storage/replica_learner_test.go, line 564 at r5 (raw file):

	// started.
	close(blockSnapshotsCh)
	if err := g.Wait(); !testutils.IsError(err, `descriptor changed|raft group deleted`) {

Why is this now possible?

pkg/storage/replica_proposal_buf.go, line 628 at r5 (raw file):

	})
	if isRemoved {
		err = rp.mu.destroyStatus.err

We already check this before calling withGroupLocked on all code paths (for correctness), so I think you can make this fatal.

pkg/storage/store.go, line 2118 at r5 (raw file):

	// Check to make sure we don't know that the RHS has already been removed
	// from this store at the replica ID implied by this split.
	if !hasRightDesc || rightRepl.minReplicaIDRLocked() <= rightDesc.ReplicaID {

Two questions:

Is the rightRepl.minReplicaIDRLocked() < rightDesc.ReplicaID case even possible? How could we have ever seen a replica with a lower range ID than the one our store was originally assigned at the time of the split?
In the case where rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID and the replica was already on this store and then removed, will we be violating its raft tombstone by allowing the creation of this replica? Is that ok?

pkg/storage/store.go, line 2154 at r5 (raw file):

	hs.Commit = 0
	const mustUseClearRange = false
	const rangeIDLocalOnly = false

nit: wrong order

pkg/storage/store.go, line 2186 at r5 (raw file):

		// return.
		rightDesc, hasRightDesc := split.RightDesc.GetReplicaDescriptor(r.StoreID())
		if hasRightDesc && rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {

Mind calling out the !hasRightDesc case separately? It's pretty subtle.

pkg/storage/store_snapshot.go, line 741 at r5 (raw file):

func (s *Store) shouldAcceptSnapshotData(
	ctx context.Context, snapHeader *SnapshotRequest_Header,
) (err error) {

nit: why was this necessary?

pkg/storage/store_test.go, line 573 at r5 (raw file):

		DestroyData: true,
	}); err != nil {
		t.Fatalf("didn't expect error re-removing same range: %v", err)

Why did this behavior change?

ajwerner

Thanks for such a quick next pass! Should be RFAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/client_raft_test.go, line 3004 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add a new line after this so that it's clear that this is still setup and not actually partitioning yet.

Also added an argument to the setup call to control whether it starts out active.

pkg/storage/client_raft_test.go, line 4618 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

split and it
make it a little clearer that these are two different cases

Done. Added (1) and (2) to show how they correspond to the comment below.

pkg/storage/client_raft_test.go, line 4748 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

missing part of this comment

Added more commentary.

pkg/storage/client_raft_test.go, line 4750 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a // unsuccessful here? The bottom few lines assume that the replica was not re-added here, but scanning through this test indicates the opposite.

Done.

pkg/storage/client_raft_test.go, line 4753 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why would we automatically vote? We didn't call an election or anything, right?

Voted is the wrong language. Furthermore we really won't ever have voted in today's implementation because nobody would promote the learner without successfully sending a learner snapshot but the code is trying to be safe even if we did promote that RHS. I think the comment was aspirational.

Perhaps there should be another version of this test which goes and promotes the RHS to a voter and then calls an election and validates that the votes survive a split. Thoughts?

pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Think it's worth keeping the first case at all? This isn't performance-critical whatsoever.

VOTER_OUTGOING is in the descriptor but ends up in removed below. Don't we want to wait until we're actually out of the descriptor?

pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):

Are you worried about a command that truncates AND removes the replica?
Yes.

pkg/storage/replica_application_state_machine.go, line 609 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: re-order with above

Done.

pkg/storage/replica_application_state_machine.go, line 760 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"However, if we're removing ..."

Done.

pkg/storage/replica_destroy.go, line 82 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should also adopt part of the comment from #40814 (review) about this not being required for correctness.

Also, are you planning on merging that PR?

Yes, just bors'd it. Will rebase.

pkg/storage/replica_gc_queue.go, line 225 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: not your change, but pull desc.RangeID == replyDesc.RangeID into a variable so it's more clear that the second case is the sameRange && !currentMember case.

Done.

pkg/storage/replica_learner_test.go, line 439 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What are the LHS and RHS in this test? Are these supposed to be referring to the first and second replica on node 2?

This is garbage now and was always a garbage comment. Removed.

pkg/storage/replica_learner_test.go, line 564 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this now possible?

Added commentary.

pkg/storage/replica_proposal_buf.go, line 628 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We already check this before calling withGroupLocked on all code paths (for correctness), so I think you can make this fatal.

Done.

pkg/storage/store.go, line 2118 at r5 (raw file):

Is the rightRepl.minReplicaIDRLocked() < rightDesc.ReplicaID case even possible? How could we have ever seen a replica with a lower range ID than the one our store was originally assigned at the time of the split?

No

In the case where rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID and the replica was already on this store and then removed, will we be violating its raft tombstone by allowing the creation of this replica? Is that ok?

rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID will happen if we've received messages for the RHS at that replica ID but never laid down a tombstone. I think it's a happy case.

Nevertheless, I'm actually not convinced that it's doing the right thing. Even if we don't have a right range descriptor, if we have a tombstone we know that it must be for replica ID above the split replica ID.

The bad case (which seems unlikely but perhaps possible) is if this follower is partitioned for long enough that the first raft message it receives for the RHS is addressed to the later replica. It won't know that it needs to write a tombstone so it won't but it will have a replica ID above the split replica ID. I guess the worst case here is if we then crash and forget about that higher replica ID. That was the case I was trying to deal with here. Technically that case would be safe but we would move backwards in terms of replica ID.

I've rolled back this minReplicaIDRLocked() method and made the logic more explicit.

One thing we could do is write a tombstone at the current replica ID when setting the replica ID from zero. I'm adding testing for these cases.

pkg/storage/store.go, line 2154 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: wrong order

Done.

pkg/storage/store.go, line 2186 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mind calling out the !hasRightDesc case separately? It's pretty subtle.

I've moved that checking into a method which has more commentary.

pkg/storage/store_snapshot.go, line 741 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: why was this necessary?

Removed.

pkg/storage/store_test.go, line 573 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why did this behavior change?

It used to be the case that we'd only remove replicas via the queue. Now we'll remove replicas on different paths. We make removal idempotent with the destroyStatus.

bdarnell

Reviewed 1 of 6 files at r1, 10 of 27 files at r2, 2 of 7 files at r3, 9 of 15 files at r5, 8 of 9 files at r6, 5 of 5 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/storage/client_raft_test.go, line 4634 at r6 (raw file):

//      HardState.
//
//   *  Before the r20 processes the split r21 is removed and re-added to

Side note about testing: Before atomic replica changes, this kind of remove and re-add scenario happened fairly frequently as the replica queue came to different conclusions on multiple iterations. Atomic replica changes (hopefully) make that less common, which also makes it harder to test for things like this. Maybe we should add an explicit replica-thrashing mode for testing purposes.

pkg/storage/client_raft_test.go, line 4708 at r6 (raw file):

			// Make the transport flaky for the range in question to encourage proposals
			// to be sent more times and ultimately traced more.

What do you mean by "traced more"?

pkg/storage/client_raft_test.go, line 4724 at r6 (raw file):

			// Now we partition a replica's LHS, perform a split, removes the RHS, add it
			// back and make sure it votes. Then we'll do the two different cases of

Where do we make sure it votes?

pkg/storage/client_raft_test.go, line 4749 at r6 (raw file):

			// Remove and re-add the RHS to create a new uninitialized replica at
			// a higher replica ID. This will lead to a tombstone being written.

How does the tombstone get written while the partition is still active?

pkg/storage/client_raft_test.go, line 4754 at r6 (raw file):

			// and will be rolled back. Nevertheless it will have learned that it
			// has been removed at the old replica ID.
			require.NotNil(t, changeReplicas(roachpb.ADD_REPLICA, keyB, 0))

Can we verify the specific expected error here?

pkg/storage/merge_queue.go, line 294 at r7 (raw file):

			return err
		}

This file only has whitespace changes now.

pkg/storage/replica_application_result.go, line 298 at r7 (raw file):

func (r *Replica) handleChangeReplicasResult(
	ctx context.Context, chng *storagepb.ChangeReplicas,
) (changeRemovedReplica bool) {

Looks like this return value actually means that the replica was not removed.

pkg/storage/replica_raft.go, line 1343 at r7 (raw file):

//
// If this Replica is in the process of being removed this method will return
// true and a nil error.

So true with a non-nil error is impossible? This seems like a special error value might have been a better design.

pkg/storage/store.go, line 2105 at r7 (raw file):

	}
	// Below is just a sanity check.
	// The only time that this case can happen is if we are currently not

This comment is unclear (what part of the two if statements is "this case"?)

pkg/storage/store.go, line 2117 at r7 (raw file):

	}

	// Check to make sure we don't know that the RHS has already been removed

The negation here is a little confusing. Add a comment that this is the common case (and I think it might be more clear if they were switched so the uncommon case were the indented conditional).

pkg/storage/store.go, line 2140 at r7 (raw file):

	// We might however have already voted at a higher term. In general
	// this shouldn't happen because we add learners and then promote them
	// only after we snapshot but for the sake of mixed version clusters and

Can we add a TODO to remove this in the next release, or will we need to keep this around longer?

pkg/storage/store.go, line 2144 at r7 (raw file):

	// old RHS and have SplitRange clear its data.
	//
	// We need to clear the data which the RHS would have inherited.

Inherited from what? What exactly is this data? How does the split trigger's write of raft state interact with the write here?

pkg/storage/store.go, line 2666 at r7 (raw file):

	if !ds.RemovingOrRemoved() {
		log.Fatalf(ctx, "cannot remove uninitialized replica which is not removal pending: %v", ds)

Should this check for exactly "pending" instead of RemovingOrRemoved?

pkg/storage/store.go, line 3477 at r7 (raw file):

	if !drop {
		isRemoved, err := r.stepRaftGroup(req)
		if isRemoved {

It looks like all callers of stepRaftGroup do this; maybe it should just return this error when isRemoved is true?

pkg/storage/store_test.go, line 2958 at r7 (raw file):

// Test that we set proper tombstones for removed replicas and use the
// tombstone to reject attempts to create a replica with a lesser ID.
func TestRemovedReplicaTombstone(t *testing.T) {

Do we have another test that covers the tombstone functionality now?

Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None

ajwerner

TFTR! Next set of revisions posted. Typing more testing for the tombstone cases. I'm also curious about how y'all feel re: writing a tombstone in getOrCreateReplica when creating an uninitialized replica with a non-zero replica ID. That would provide the invariant that a store will always have a monotonically increasing set of non-zero replica IDs for a given range even across restart.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)

pkg/storage/client_raft_test.go, line 4708 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What do you mean by "traced more"?

Sloppy remnant of code from which this was copied. Removed.

pkg/storage/client_raft_test.go, line 4724 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Where do we make sure it votes?

This was aspirational before I wrote the test. Removed.

This test might be better if it actually at a low level went through the process of adding the node to the group as a voter even if the snapshot failed (which it always will). There's code in this PR that makes such an addition safe.

pkg/storage/client_raft_test.go, line 4749 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How does the tombstone get written while the partition is still active?

The partition only applies to messages for the LHS. Added some more comments to make it clearer.

pkg/storage/client_raft_test.go, line 4754 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can we verify the specific expected error here?

Done.

pkg/storage/merge_queue.go, line 294 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This file only has whitespace changes now.

Done.

pkg/storage/replica_application_result.go, line 298 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Looks like this return value actually means that the replica was not removed.

I think it's returning the right thing. We set the destroyStatus to destroyReasonRemovalPending in runPreApplyTriggers so if it's anything other than that then we return. The reason we don't check for alive is we may be in destroyReasonMergePending in which case we can process more raft requests (though I'm not certain we should). I tried to make the comment clearer. I also updated the testing knob to ensure that we also don't mark the destroyStatus so we just continue to process the remove like before this change. This allowed me to roll back some changes to TestRemoveRangeWithoutGC.

pkg/storage/replica_raft.go, line 1343 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So true with a non-nil error is impossible? This seems like a special error value might have been a better design.

Done.

pkg/storage/store.go, line 2105 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This comment is unclear (what part of the two if statements is "this case"?)

Tried to clarify.

pkg/storage/store.go, line 2117 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The negation here is a little confusing. Add a comment that this is the common case (and I think it might be more clear if they were switched so the uncommon case were the indented conditional).

Done.

pkg/storage/store.go, line 2140 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can we add a TODO to remove this in the next release, or will we need to keep this around longer?

This code needs to exist forever I believe. It has nothing to do with pre-emptive snapshots. I guess the mixed version here was with regards to future versions. I tried to make it clearer, see how you feel about it now.

pkg/storage/store.go, line 2144 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Inherited from what? What exactly is this data? How does the split trigger's write of raft state interact with the write here?

The split trigger above raft synthesizes a number of keys for the RHS the most problematic of which being the truncated state. It was easier to delete everything than to delete everything but the hard state. We could probably assert that the Commit is 0 rather than setting it. I've tried to improve the comments.

pkg/storage/store.go, line 2666 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Should this check for exactly "pending" instead of RemovingOrRemoved?

It effectively does because of the above conditional but I've made it explicit.

pkg/storage/store.go, line 3477 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It looks like all callers of stepRaftGroup do this; maybe it should just return this error when isRemoved is true?

Done.

pkg/storage/store_test.go, line 2958 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Do we have another test that covers the tombstone functionality now?

Not yet. Will typing that now for the next revision. Just wanted to get the first set of updates up first.

tbg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)

pkg/storage/replica_destroy.go, line 38 at r21 (raw file):

	// destroy a replica (e.g. processing a ChangeReplicasTrigger removing the
	// range and receiving a raft message with a higher replica ID).
	destroyReasonRemovalPending

Very nice simplification.

pkg/storage/store.go, line 2718 at r23 (raw file):

}

// removeUninitialzedReplicaRaftMuLocked removes an uninitialzied replica.

Two typos

pkg/storage/store.go, line 3689 at r23 (raw file):

				// NB: This response may imply that this range has been added back to
				// this store at a higher replica ID. We set RangeNotFound error despite
				// the fact that this newer replica may have already been created

Stale comment

tbg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)

pkg/storage/store.go, line 2560 at r22 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't want to hold the review up on this, but why do we need to set the destroy status in runPreApplyTriggers? Just so we can detect it after the batch has been applied? Is that just a convenience that is causing this complexity here?

That seems like a valid point. It seems mildly awkward to detect the removal in the side effects since the "pre-apply" desc is gone, but there should be a way to detect this that's less leaky.

On the other hand, we do want to have the destroy status set before committing the batch. After all, lots of callers do things with the replica without the raftMu, and it seems like a bad idea to hand folks a replica whose in-mem state is basically bankrupt.

bdarnell

Reviewed 6 of 14 files at r18, 1 of 5 files at r19, 3 of 5 files at r21, 9 of 10 files at r22, 1 of 1 files at r23.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

pkg/storage/client_raft_test.go, line 4679 at r18 (raw file):

Previously, ajwerner wrote…

Yes, clarified. This is describing the 4th case.

s/4rd/4th/

pkg/storage/client_raft_test.go, line 1871 at r23 (raw file):

		mtc.manualClock.Increment(int64(storage.ReplicaGCQueueInactivityThreshold + 1))
		mtc.stores[1].MustForceReplicaGCScanAndProcess()
		testutils.SucceedsSoon(t, func() error {

Most of the time when we need a retry loop/SucceedsSoon around a replica removal, the call to MustForceReplicaGCScanAndProcess needs to be inside the loop.

pkg/storage/replica.go, line 1088 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

When a new term is received (or when a first term is received) this should lead to a HardState being emitted which reflects the new term; there's no difference between learners and voters. A vote can be cast by a learner when it's asked to vote, but this implies that the sender of the message thought it was talking to a voter (and the vote will only matter if the active config says so).

Preemptive snapshots can cause a HardState.Term (and Commit?) to be written. You'll never have a Vote to write unless you were at some point a valid voter (you may not have known that you were a voter at the time you voted, but the node that asked you to vote knew).

pkg/storage/replica.go, line 1090 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

A learner can get asked to vote if it's already a voter and doesn't know it yet (because it hasn't caught up on the log yet). You shouldn't see a learner write down a HardState.Vote unless it has cast a vote, which means someone asked it to vote (and that someone must've thought it was a voter, and used an appropriate replicaID and term for the request).

The argument here sounds right: we'd have a problem if the RHS got promoted and cast a vote that we'd then erase. But for that to happen, it needs to accept a snapshot, which it can't because of the overlapping LHS. So all the RHS can put into its HardState is a Term (and no Commit, because there's no raft log). So we'll wipe out a persisted Term. I went back to the raft thesis to convince myself that the term needs to be made durable, but now I'm not so sure why. When a vote is cast, one definitely has to remember what term it belongs to, but other than that it seems safe to always take the term from the end of the log when trying to become leader; after all terms are not reusable thanks to the persistence of the <term,vote> pairs. I might be missing something, but if not, the argument above convinces me.

I also can't think of why the current term must be persisted if there is no vote or log entries, but it's explicit in etcd/raft.MustSync. Failure to durably persist a term might cause extra election cycles after restarts, but other than that I can't think of a problem with it.

pkg/storage/replica_application_result.go, line 114 at r23 (raw file):

// result will zero-out the struct to ensure that is has properly performed all
// of the implied side-effects.
func (r *Replica) prepareLocalResult(ctx context.Context, cmd *replicatedCmd) (isRemoved bool) {

I don't see any assignments to this isRemoved; the method always returns false.

pkg/storage/replica_application_result.go, line 330 at r23 (raw file):

		// the store will no-op when removing a replica which is already marked as removed
		// unless we set ignoreDestroyStatus to true.
		ignoreDestroyStatus: true,

This appears to be the only use of DestroyData: false (unless there are some that use this because it's the default). Should we just make DestroyData: false imply ignoreDestroyStatus, or could there be other uses of DestroyData:false that don't imply "already destroyed, ignore the flag"?

ajwerner

Thanks for sticking with me y'all. I'm going to try to merge this soon. I'm now hitting failures in stress that exist in master. I'll wait a bit this evening and then bors it. Thanks.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/client_raft_test.go, line 4679 at r18 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/4rd/4th/

Done.

pkg/storage/client_raft_test.go, line 1871 at r23 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Most of the time when we need a retry loop/SucceedsSoon around a replica removal, the call to MustForceReplicaGCScanAndProcess needs to be inside the loop.

Done. After this change should happen just due to ReplicaTooOldError but did it anyway to avoid confusion.

pkg/storage/replica_application_result.go, line 114 at r23 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't see any assignments to this isRemoved; the method always returns false.

Derp I meant to remove the return value, in the last pass. Done.

pkg/storage/replica_application_result.go, line 330 at r23 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This appears to be the only use of DestroyData: false (unless there are some that use this because it's the default). Should we just make DestroyData: false imply ignoreDestroyStatus, or could there be other uses of DestroyData:false that don't imply "already destroyed, ignore the flag"?

I don't have strong feelings here. I sort of like it to be more explicit.

pkg/storage/replica_destroy.go, line 38 at r21 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Very nice simplification.

Done.

pkg/storage/store.go, line 2560 at r22 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That seems like a valid point. It seems mildly awkward to detect the removal in the side effects since the "pre-apply" desc is gone, but there should be a way to detect this that's less leaky.

On the other hand, we do want to have the destroy status set before committing the batch. After all, lots of callers do things with the replica without the raftMu, and it seems like a bad idea to hand folks a replica whose in-mem state is basically bankrupt.

We definitely need to set the destroyStatus before we commit the batch. The destroyStatus controls whether we process requests.

pkg/storage/store.go, line 2718 at r23 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Two typos

Done.

pkg/storage/store.go, line 3689 at r23 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Stale comment

Done.

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"

bdarnell

Reviewed 3 of 3 files at r24.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)

ajwerner · 2019-09-24T22:26:04Z

TFTR!

bors r+

40892: storage: ensure Replica objects never change replicaID r=ajwerner a=ajwerner We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes #40367. The first commit is #40889. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change. Co-authored-by: Andrew Werner <[email protected]>

craig · 2019-09-24T22:56:12Z

Build succeeded

GitHub CI (Cockroach)

40907: storage: clean up preemptive snapshot when receiving replica ID as learner r=ajwerner a=ajwerner This commit adds an annotation to raft request messages to indicate that the sender believes the current replica is a learner. If the current replica on the recipient was created as a preemptive snapshot (it's initialized but not in the range) then we should remove that replica immediately. The first commit is #40892. Release Justification: Release blocker, required for backwards compatibility. Release note: None Co-authored-by: Andrew Werner <[email protected]>

…atedNextReplicaID Fixes cockroachdb#41145. This bug was introduced in cockroachdb#40892. This may force us to pick a new SHA for the beta. Any ChangeReplicas Raft entry from 19.1 or before is going to crash a node without it. Release justification: fixes a crash in mixed version clusters. Release note: None

41148: storage: don't crash when applying ChangeReplicas trigger with DeprecatedNextReplicaID r=nvanbenschoten a=nvanbenschoten Fixes #41145. This bug was introduced in #40892. This may force us to pick a new SHA for the beta. Any ChangeReplicas Raft entry from 19.1 or before is going to crash a node without it. Release justification: fixes a crash in mixed version clusters. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None

This is an invariant that was established in [cockroachdb#40892]. We now check it more aggressively. There is an awkward exception thanks to uninitialized replicas. [cockroachdb#40892]: cockroachdb#40892 Release note: None fixupassert

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from c207675 to f579994 Compare September 18, 2019 23:35

ajwerner commented Sep 19, 2019

View reviewed changes

ajwerner mentioned this pull request Sep 19, 2019

storage: ensure Replica objects never change replicaID #40751

Closed

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch 4 times, most recently from 18d895a to b234edb Compare September 19, 2019 14:20

ajwerner commented Sep 19, 2019

View reviewed changes

ajwerner requested review from nvanbenschoten and bdarnell September 19, 2019 14:23

ajwerner mentioned this pull request Sep 19, 2019

storage: clean up preemptive snapshot when receiving replica ID as learner #40907

Merged

ajwerner commented Sep 19, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from b234edb to dbb0616 Compare September 19, 2019 17:46

nvanbenschoten reviewed Sep 19, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from dbb0616 to 9cd9bee Compare September 19, 2019 18:42

ajwerner commented Sep 19, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch 3 times, most recently from 52436cc to b11044b Compare September 20, 2019 01:51

ajwerner commented Sep 20, 2019

View reviewed changes

nvanbenschoten reviewed Sep 20, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from b11044b to 3c642a7 Compare September 20, 2019 14:42

ajwerner commented Sep 20, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch 2 times, most recently from aacc284 to 65f9e93 Compare September 20, 2019 17:20

bdarnell reviewed Sep 20, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from 65f9e93 to 7fd2f9a Compare September 20, 2019 19:55

ajwerner commented Sep 20, 2019

View reviewed changes

tbg reviewed Sep 24, 2019

View reviewed changes

tbg requested a review from nvanbenschoten September 24, 2019 20:20

tbg reviewed Sep 24, 2019

View reviewed changes

bdarnell reviewed Sep 24, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from a6d3c37 to b4b531f Compare September 24, 2019 21:01

ajwerner commented Sep 24, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/prevent-replica-id-changes-pr branch from b4b531f to 2020115 Compare September 24, 2019 21:44

bdarnell approved these changes Sep 24, 2019

View reviewed changes

craig bot merged commit 2020115 into cockroachdb:master Sep 24, 2019

ajwerner mentioned this pull request Sep 25, 2019

roachtest: clearrange/checks=false failed #40934

Closed

nvanbenschoten mentioned this pull request Sep 26, 2019

storage: don't crash when applying ChangeReplicas trigger with DeprecatedNextReplicaID #41148

Merged

bdarnell mentioned this pull request Oct 14, 2019

storage: TestReplicateRogueRemovedNode failed under stress #41466

Closed

ajwerner mentioned this pull request Nov 25, 2019

partitionccl: enhance partition test to use leaseholders and more nodes #42758

Open

ajwerner mentioned this pull request Apr 13, 2020

storage: Consider not allowing a replica's ID to change #14349

Closed

nvanbenschoten mentioned this pull request May 10, 2021

kv: write replica tombstone on replica creation instead of removal #64966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: ensure Replica objects never change replicaID #40892

storage: ensure Replica objects never change replicaID #40892

ajwerner commented Sep 18, 2019

cockroach-teamcity commented Sep 18, 2019

ajwerner left a comment

ajwerner left a comment

ajwerner left a comment

nvanbenschoten left a comment

ajwerner left a comment

ajwerner left a comment

nvanbenschoten left a comment

ajwerner left a comment

bdarnell left a comment

ajwerner left a comment

tbg left a comment

tbg left a comment

bdarnell left a comment

ajwerner left a comment

bdarnell left a comment

ajwerner commented Sep 24, 2019

craig bot commented Sep 24, 2019

storage: ensure Replica objects never change replicaID #40892

storage: ensure Replica objects never change replicaID #40892

Conversation

ajwerner commented Sep 18, 2019

cockroach-teamcity commented Sep 18, 2019

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

ajwerner commented Sep 24, 2019

craig bot commented Sep 24, 2019

Build succeeded