-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: ensure Replica objects never change replicaID #40892
storage: ensure Replica objects never change replicaID #40892
Conversation
c207675
to
f579994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/store.go, line 2368 at r2 (raw file):
batch := rightRepl.Engine().NewWriteOnlyBatch() defer batch.Close() if err := clearRangeData(oldRightDesc, rightRepl.Engine(), batch, clearReplicatedOnly, false); err != nil {
This code needs to clear the truncated state too. In practice is should load the hard state, clear everything and then write back a hard state with a 0 Commit
18d895a
to
b234edb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/client_raft_test.go, line 4785 at r3 (raw file):
// is sent first (we block raft snapshots from being sent while sending // the learner snapshot but not the other way around). testutils.SucceedsSoon(t, func() error {
It seems like #40435 should have dealt with it. Perhaps it's because the raft snapshot is occurring from the raft leader which isn't the leaseholder.
b234edb
to
dbb0616
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting pulled into an interview so I'll leave this here. Made it most of the way through non-testing changes.
Reviewed 12 of 27 files at r2, 6 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/client_test.go, line 1246 at r3 (raw file):
} func (m *multiTestContext) waitForUnreplicated(rangeID roachpb.RangeID, dest int) error {
Give this method a comment.
pkg/storage/queue.go, line 670 at r3 (raw file):
log.Infof(ctx, "adding: priority=%0.3f", priority) }
stray line
pkg/storage/replica.go, line 1056 at r3 (raw file):
return r.mu.replicaID } return r.mu.minReplicaID
Could you rename minReplicaID
to ensure that we're always using this when we should be?
pkg/storage/replica_application_result.go, line 300 at r3 (raw file):
) (changeRemovedReplica bool) { // If this command removes us then we need to go through the process of // removing our replica from the store. After this method returns the code
nit: add a comma after "returns"
pkg/storage/replica_application_result.go, line 316 at r3 (raw file):
log.Infof(ctx, "removing replica due to ChangeReplicasTrigger: %v", chng) } if err := r.postDestroyRaftMuLocked(ctx, r.GetMVCCStats()); err != nil {
Just like in destroyRaftMuLocked
, we need to make sure that the removal has been durably sync'ed before we get here, right? We can plumb this logic up and sync when we apply the application batch, but it would probably be easiest to just sync the RocksDB WAL directly right before calling postDestroyRaftMuLocked
.
pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):
return false } _, existsInChange := change.Desc.GetReplicaDescriptor(storeID)
I think this missed my comment from the other PR. This isn't backward compatible.
pkg/storage/replica_application_state_machine.go, line 595 at r3 (raw file):
const mustClearRange = false if err := rhsRepl.preDestroyRaftMuLocked( ctx, b.batch, b.batch, roachpb.ReplicaID(math.MaxInt32), clearRangeIDLocalOnly, mustClearRange,
Can we pull math.MaxInt32
into a constant somewhere and use it across all calls?
pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):
// We'll complete the removal when it commits. Later logic detects the // removal by inspecting the destroy status if change := res.ChangeReplicas; change != nil &&
Pull this up below the res.Merge
case.
pkg/storage/replica_application_state_machine.go, line 645 at r3 (raw file):
b.r.mu.Unlock() b.changeRemovesReplica = true const mustUseRangeDeletionTombstone = true
Why? If we don't have a good reason for this then set it to false.
pkg/storage/replica_application_state_machine.go, line 793 at r3 (raw file):
// have been applied as of this batch. It also records the Range's mvcc stats. func (b *replicaAppBatch) addAppliedStateKeyToBatch(ctx context.Context) error { if b.changeRemovesReplica {
Give this a comment.
pkg/storage/replica_application_state_machine.go, line 933 at r3 (raw file):
// and the first range has a lease that there will not be a later hard-state. if shouldAssert { sm.r.mu.Lock()
nit: move back
pkg/storage/replica_application_state_machine.go, line 1026 at r3 (raw file):
shouldAssert = !rResult.Equal(storagepb.ReplicatedEvalResult{}) if !shouldAssert { return false, sm.batch.changeRemovesReplica
return false, false
We shouldn't be referencing the sm.batch
. It's only a member of the state machine to avoid allocations. Also, we know this is false because otherwise rResult.ChangeReplicas != nil
pkg/storage/replica_application_state_machine.go, line 1094 at r3 (raw file):
return nil } // If we're removed then we know that must have happened due to this
Why is this ever possible? If we were removed then how would we be here calling maybeApplyConfChange
? Can't we assert that this is false?
pkg/storage/replica_destroy.go, line 71 at r3 (raw file):
// RemovalPending returns whether the replica is removed or in the process of // being removed. func (s destroyStatus) RemovalPending() bool {
RemovingOrRemoved
? This name implies that this is implemented as return s.reason == destroyReasonRemovalPending
.
pkg/storage/replica_destroy.go, line 145 at r3 (raw file):
// Clear the range ID local data including the hard state. // We don't know about any user data so we can't clear any user data. // See the comment on this method for why this is safe.
Make a note that this is the only real reason why this method is separate from Replica.destroyRaftMuLocked
.
pkg/storage/replica_destroy.go, line 166 at r3 (raw file):
} if r.raftMu.sideloaded != nil {
Is it worth adjusting postDestroyRaftMuLocked
to not suggest a compaction when the stats are empty and then just calling that here?
pkg/storage/replica_gc_queue.go, line 290 at r3 (raw file):
nextReplicaID := replyDesc.NextReplicaID if currentMember { // If we're a member of the current range descriptor then we must not put
Why don't we just always put a tombstone down at currentDesc.NextReplicaID
? Depending on the answer to that question, should this instead be:
if currentMember {
nextReplicaID--
}
pkg/storage/replica_gc_queue.go, line 292 at r3 (raw file):
// If we're a member of the current range descriptor then we must not put // down a tombstone at replyDesc.NextReplicaID as that would prevent us // from getting a snapshot and becoming a part of the rang.e
"rang.e"
pkg/storage/replica_init.go, line 171 at r3 (raw file):
log.Fatalf(ctx, "cannot set replica ID from anything other than 0, currently %d", r.mu.replicaID) // The common case: the replica ID is unchanged.
Outdated comment.
pkg/storage/replica_init.go, line 249 at r3 (raw file):
// not yet been initialized. If not, it is created and set to campaign // if this replica is the most recent owner of the range lease. func (r *Replica) maybeInitializeRaftGroup(ctx context.Context) {
Leave a comment about why we don't do anything if the replica is already destroyed. Something about how we will check this later on.
pkg/storage/replica_raftstorage.go, line 684 at r4 (raw file):
const ( clearRangeIDLocalOnly clearRangeOption = iota
It would be helpful to have an example of why each of these would be used.
pkg/storage/replica_raftstorage.go, line 690 at r4 (raw file):
// clearRangeData clears the data associated with a range descriptor. If // rangeIDLocalOnly is true, then only the range-id local keys are deleted.
This comment is now stale.
pkg/storage/store.go, line 2143 at r3 (raw file):
} else if ok { log.Infof(ctx, "split trigger found right-hand side with tombstone %+v, "+ "RHS must have been removed: %v", tombstone, rightRng)
How is this case handled? I thought we were supposed to fall into the same case a if rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {
, but it's not immediately obvious that it does. Could you call this out?
pkg/storage/store.go, line 2155 at r3 (raw file):
} } // If we're in this case then we know that the RHS has since been removed
nit: move this comment into the if
since it's about "this case".
pkg/storage/store.go, line 2278 at r3 (raw file):
// of a Raft command. // // The funky case is if rightDesc is non-nil. If so then we know that the right
s/right/Desc/oldRightDesc/
Also, is this the best name? Why is it "old"? Why does splitPostApply
pass the descriptor from the SplitTrigger, but only sometimes?
pkg/storage/store.go, line 2279 at r3 (raw file):
// // The funky case is if rightDesc is non-nil. If so then we know that the right // replica has been removed and re-added before applying this split and we need
"re-added as an uninitialized replica"
pkg/storage/store.go, line 2381 at r3 (raw file):
return errors.Wrap(err, "failed to set hard state with 0 commit index for removed rhs") } if err := batch.Commit(true); err != nil {
This is all in the splitPostApply
, right? So what would happen if the node crashed before getting here but after its split application batch had made it to disk?
pkg/storage/store.go, line 4175 at r3 (raw file):
// sender that they may no longer exist. err = roachpb.NewRangeNotFoundError(rangeID, s.StoreID()) } else if replicaID != 0 && repl.mu.replicaID != replicaID {
I don't understand this case. Why is this unexpected? Because we would have caught this in handleToReplicaTooOld
?
dbb0616
to
9cd9bee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only made it though half of your comments before heading out to the scavenger hunt. Will finish after.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/client_test.go, line 1246 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this method a comment.
Done.
pkg/storage/queue.go, line 670 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
stray line
Done.
pkg/storage/replica.go, line 1056 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you rename
minReplicaID
to ensure that we're always using this when we should be?
Renamed to tombstoneReplicaID
pkg/storage/replica_application_result.go, line 300 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: add a comma after "returns"
Done.
pkg/storage/replica_application_result.go, line 316 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Just like in
destroyRaftMuLocked
, we need to make sure that the removal has been durably sync'ed before we get here, right? We can plumb this logic up and sync when we apply the application batch, but it would probably be easiest to just sync the RocksDB WAL directly right before callingpostDestroyRaftMuLocked
.
It was easy enough to move the sync to when we commit the batch.
pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think this missed my comment from the other PR. This isn't backward compatible.
Done.
pkg/storage/replica_application_state_machine.go, line 595 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we pull
math.MaxInt32
into a constant somewhere and use it across all calls?
Done.
pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Pull this up below the
res.Merge
case.
It feels like logically the truncated state change could write to the batch. Why just below merge?
pkg/storage/replica_application_state_machine.go, line 645 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why? If we don't have a good reason for this then set it to false.
Done.
pkg/storage/replica_application_state_machine.go, line 793 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment.
Lifted it instead and commented there.
pkg/storage/replica_application_state_machine.go, line 933 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move back
Done.
pkg/storage/replica_application_state_machine.go, line 1026 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
return false, false
We shouldn't be referencing the
sm.batch
. It's only a member of the state machine to avoid allocations. Also, we know this is false because otherwiserResult.ChangeReplicas != nil
Done.
pkg/storage/replica_application_state_machine.go, line 1094 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this ever possible? If we were removed then how would we be here calling
maybeApplyConfChange
? Can't we assert that this is false?
Comment is stale, there is an assertion below. Updated.
pkg/storage/replica_destroy.go, line 71 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
RemovingOrRemoved
? This name implies that this is implemented asreturn s.reason == destroyReasonRemovalPending
.
Done.
pkg/storage/replica_destroy.go, line 166 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it worth adjusting
postDestroyRaftMuLocked
to not suggest a compaction when the stats are empty and then just calling that here?
It probably is.
52436cc
to
b11044b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I moved the nasty hacks that had been added to SplitRange
into splitPreApply
and I think it's better.
It's ready for another look.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/replica_destroy.go, line 145 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Make a note that this is the only real reason why this method is separate from
Replica.destroyRaftMuLocked
.
I've folded the whole thing up into destroyReplicaRaftMuLocked
, I think it's better but I could go either way.
pkg/storage/replica_gc_queue.go, line 290 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why don't we just always put a tombstone down at
currentDesc.NextReplicaID
? Depending on the answer to that question, should this instead be:if currentMember { nextReplicaID-- }
My bad. This is stale, this can never be the case. notice in the conditional above we cannot be a current member. This is literally the not current member case.
pkg/storage/replica_gc_queue.go, line 292 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"rang.e"
Gone.
pkg/storage/replica_init.go, line 171 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Outdated comment.
Done.
pkg/storage/replica_init.go, line 249 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Leave a comment about why we don't do anything if the replica is already destroyed. Something about how we will check this later on.
I added a comment but it could be improved.
pkg/storage/replica_raftstorage.go, line 684 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It would be helpful to have an example of why each of these would be used.
Removed.
pkg/storage/replica_raftstorage.go, line 690 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This comment is now stale.
not anymore 😉
pkg/storage/store.go, line 2143 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How is this case handled? I thought we were supposed to fall into the same case a
if rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {
, but it's not immediately obvious that it does. Could you call this out?
It is, it's a subset of that case. I don't know why we're reading the tombstone anymore. I removed it.
pkg/storage/store.go, line 2278 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/right/Desc/oldRightDesc/
Also, is this the best name? Why is it "old"? Why does
splitPostApply
pass the descriptor from the SplitTrigger, but only sometimes?
Removed.
pkg/storage/store.go, line 2279 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"re-added as an uninitialized replica"
Removed.
pkg/storage/store.go, line 2381 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is all in the
splitPostApply
, right? So what would happen if the node crashed before getting here but after its split application batch had made it to disk?
oof that's totally right. I was sort of worried about such things myself. I moved this whole set of cases into split preApply so that they happen atomically with the batch. I think it generally improves the change.
pkg/storage/store.go, line 4175 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't understand this case. Why is this unexpected? Because we would have caught this in
handleToReplicaTooOld
?
Yes, that or the above case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 27 files at r2, 1 of 1 files at r4, 15 of 15 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)
pkg/storage/client_raft_test.go, line 3004 at r5 (raw file):
// partition node 0 from the others. partRange, err := setupPartitionedRange(mtc, 1, 0) require.NoError(t, err)
nit: add a new line after this so that it's clear that this is still setup and not actually partitioning yet.
pkg/storage/client_raft_test.go, line 4618 at r5 (raw file):
split and it
make it a little clearer that these are two different cases
pkg/storage/client_raft_test.go, line 4748 at r5 (raw file):
require.True(t, store0Exists) // Now we should be able to have the RHS in
missing part of this comment
pkg/storage/client_raft_test.go, line 4750 at r5 (raw file):
// Now we should be able to have the RHS in require.NoError(t, changeReplicas(roachpb.REMOVE_REPLICA, keyB, 0)) require.NotNil(t, changeReplicas(roachpb.ADD_REPLICA, keyB, 0))
Could you add a // unsuccessful
here? The bottom few lines assume that the replica was not re-added here, but scanning through this test indicates the opposite.
pkg/storage/client_raft_test.go, line 4753 at r5 (raw file):
// Read the HardState of the RHS to ensure that we've written a tombstone // with our old replica ID and then confirm that we've voted.
Why would we automatically vote? We didn't call an election or anything, right?
pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):
Previously, ajwerner wrote…
Done.
Think it's worth keeping the first case at all? This isn't performance-critical whatsoever.
pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):
Previously, ajwerner wrote…
It feels like logically the truncated state change could write to the batch. Why just below merge?
I was thinking we should keep the Range configuration triggers together. No strong opinion though.
Are you worried about a command that truncates AND removes the replica?
pkg/storage/replica_application_state_machine.go, line 609 at r5 (raw file):
} const mustClearRange = false const clearRangeIDLocalOnly = true
nit: re-order with above
pkg/storage/replica_application_state_machine.go, line 760 at r5 (raw file):
// the applied state is stored in this batch, ensure that if the batch ends // up not being durably committed then the entries in this batch will be // applied again upon startup. If we're removing the replica's data then we
"However, if we're removing ..."
pkg/storage/replica_destroy.go, line 145 at r3 (raw file):
Previously, ajwerner wrote…
I've folded the whole thing up into
destroyReplicaRaftMuLocked
, I think it's better but I could go either way.
Yeah, that's better.
pkg/storage/replica_destroy.go, line 82 at r5 (raw file):
// know new replicas can never be created so this value is used even if we // don't know the current replica ID. const mergedTombstoneReplicaID roachpb.ReplicaID = math.MaxInt32
We should also adopt part of the comment from #40814 (review) about this not being required for correctness.
Also, are you planning on merging that PR?
pkg/storage/replica_gc_queue.go, line 225 at r5 (raw file):
// Maybe it was deleted "under us" by being moved. currentDesc, currentMember := replyDesc.GetReplicaDescriptor(repl.store.StoreID()) if desc.RangeID == replyDesc.RangeID && currentMember {
nit: not your change, but pull desc.RangeID == replyDesc.RangeID
into a variable so it's more clear that the second case is the sameRange && !currentMember
case.
pkg/storage/replica_learner_test.go, line 439 at r5 (raw file):
tc.RemoveReplicasOrFatal(t, scratchStartKey, tc.Target(1)) // remove learner // We need to wait until the LHS range has been GC'd on server 1 before the // RHS can be successfully added. Wait for the RHS to exist.
What are the LHS and RHS in this test? Are these supposed to be referring to the first and second replica on node 2?
pkg/storage/replica_learner_test.go, line 564 at r5 (raw file):
// started. close(blockSnapshotsCh) if err := g.Wait(); !testutils.IsError(err, `descriptor changed|raft group deleted`) {
Why is this now possible?
pkg/storage/replica_proposal_buf.go, line 628 at r5 (raw file):
}) if isRemoved { err = rp.mu.destroyStatus.err
We already check this before calling withGroupLocked
on all code paths (for correctness), so I think you can make this fatal.
pkg/storage/store.go, line 2118 at r5 (raw file):
// Check to make sure we don't know that the RHS has already been removed // from this store at the replica ID implied by this split. if !hasRightDesc || rightRepl.minReplicaIDRLocked() <= rightDesc.ReplicaID {
Two questions:
- Is the
rightRepl.minReplicaIDRLocked() < rightDesc.ReplicaID
case even possible? How could we have ever seen a replica with a lower range ID than the one our store was originally assigned at the time of the split? - In the case where
rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID
and the replica was already on this store and then removed, will we be violating its raft tombstone by allowing the creation of this replica? Is that ok?
pkg/storage/store.go, line 2154 at r5 (raw file):
hs.Commit = 0 const mustUseClearRange = false const rangeIDLocalOnly = false
nit: wrong order
pkg/storage/store.go, line 2186 at r5 (raw file):
// return. rightDesc, hasRightDesc := split.RightDesc.GetReplicaDescriptor(r.StoreID()) if hasRightDesc && rightRng.minReplicaIDRLocked() > rightDesc.ReplicaID {
Mind calling out the !hasRightDesc
case separately? It's pretty subtle.
pkg/storage/store_snapshot.go, line 741 at r5 (raw file):
func (s *Store) shouldAcceptSnapshotData( ctx context.Context, snapHeader *SnapshotRequest_Header, ) (err error) {
nit: why was this necessary?
pkg/storage/store_test.go, line 573 at r5 (raw file):
DestroyData: true, }); err != nil { t.Fatalf("didn't expect error re-removing same range: %v", err)
Why did this behavior change?
b11044b
to
3c642a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for such a quick next pass! Should be RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/client_raft_test.go, line 3004 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: add a new line after this so that it's clear that this is still setup and not actually partitioning yet.
Also added an argument to the setup call to control whether it starts out active.
pkg/storage/client_raft_test.go, line 4618 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
split and it
make it a little clearer that these are two different cases
Done. Added (1) and (2) to show how they correspond to the comment below.
pkg/storage/client_raft_test.go, line 4748 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
missing part of this comment
Added more commentary.
pkg/storage/client_raft_test.go, line 4750 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a
// unsuccessful
here? The bottom few lines assume that the replica was not re-added here, but scanning through this test indicates the opposite.
Done.
pkg/storage/client_raft_test.go, line 4753 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why would we automatically vote? We didn't call an election or anything, right?
Voted is the wrong language. Furthermore we really won't ever have voted in today's implementation because nobody would promote the learner without successfully sending a learner snapshot but the code is trying to be safe even if we did promote that RHS. I think the comment was aspirational.
Perhaps there should be another version of this test which goes and promotes the RHS to a voter and then calls an election and validates that the votes survive a split. Thoughts?
pkg/storage/replica_application_state_machine.go, line 531 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Think it's worth keeping the first case at all? This isn't performance-critical whatsoever.
VOTER_OUTGOING is in the descriptor but ends up in removed below. Don't we want to wait until we're actually out of the descriptor?
pkg/storage/replica_application_state_machine.go, line 629 at r3 (raw file):
Are you worried about a command that truncates AND removes the replica?
Yes.
pkg/storage/replica_application_state_machine.go, line 609 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: re-order with above
Done.
pkg/storage/replica_application_state_machine.go, line 760 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"However, if we're removing ..."
Done.
pkg/storage/replica_destroy.go, line 82 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should also adopt part of the comment from #40814 (review) about this not being required for correctness.
Also, are you planning on merging that PR?
Yes, just bors'd it. Will rebase.
pkg/storage/replica_gc_queue.go, line 225 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: not your change, but pull
desc.RangeID == replyDesc.RangeID
into a variable so it's more clear that the second case is thesameRange && !currentMember
case.
Done.
pkg/storage/replica_learner_test.go, line 439 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What are the LHS and RHS in this test? Are these supposed to be referring to the first and second replica on node 2?
This is garbage now and was always a garbage comment. Removed.
pkg/storage/replica_learner_test.go, line 564 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this now possible?
Added commentary.
pkg/storage/replica_proposal_buf.go, line 628 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We already check this before calling
withGroupLocked
on all code paths (for correctness), so I think you can make this fatal.
Done.
pkg/storage/store.go, line 2118 at r5 (raw file):
- Is the rightRepl.minReplicaIDRLocked() < rightDesc.ReplicaID case even possible? How could we have ever seen a replica with a lower range ID than the one our store was originally assigned at the time of the split?
No
- In the case where rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID and the replica was already on this store and then removed, will we be violating its raft tombstone by allowing the creation of this replica? Is that ok?
rightRepl.minReplicaIDRLocked() == rightDesc.ReplicaID
will happen if we've received messages for the RHS at that replica ID but never laid down a tombstone. I think it's a happy case.
Nevertheless, I'm actually not convinced that it's doing the right thing. Even if we don't have a right range descriptor, if we have a tombstone we know that it must be for replica ID above the split replica ID.
The bad case (which seems unlikely but perhaps possible) is if this follower is partitioned for long enough that the first raft message it receives for the RHS is addressed to the later replica. It won't know that it needs to write a tombstone so it won't but it will have a replica ID above the split replica ID. I guess the worst case here is if we then crash and forget about that higher replica ID. That was the case I was trying to deal with here. Technically that case would be safe but we would move backwards in terms of replica ID.
I've rolled back this minReplicaIDRLocked()
method and made the logic more explicit.
One thing we could do is write a tombstone at the current replica ID when setting the replica ID from zero. I'm adding testing for these cases.
pkg/storage/store.go, line 2154 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: wrong order
Done.
pkg/storage/store.go, line 2186 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mind calling out the
!hasRightDesc
case separately? It's pretty subtle.
I've moved that checking into a method which has more commentary.
pkg/storage/store_snapshot.go, line 741 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: why was this necessary?
Removed.
pkg/storage/store_test.go, line 573 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why did this behavior change?
It used to be the case that we'd only remove replicas via the queue. Now we'll remove replicas on different paths. We make removal idempotent with the destroyStatus.
aacc284
to
65f9e93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r1, 10 of 27 files at r2, 2 of 7 files at r3, 9 of 15 files at r5, 8 of 9 files at r6, 5 of 5 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/client_raft_test.go, line 4634 at r6 (raw file):
// HardState. // // * Before the r20 processes the split r21 is removed and re-added to
Side note about testing: Before atomic replica changes, this kind of remove and re-add scenario happened fairly frequently as the replica queue came to different conclusions on multiple iterations. Atomic replica changes (hopefully) make that less common, which also makes it harder to test for things like this. Maybe we should add an explicit replica-thrashing mode for testing purposes.
pkg/storage/client_raft_test.go, line 4708 at r6 (raw file):
// Make the transport flaky for the range in question to encourage proposals // to be sent more times and ultimately traced more.
What do you mean by "traced more"?
pkg/storage/client_raft_test.go, line 4724 at r6 (raw file):
// Now we partition a replica's LHS, perform a split, removes the RHS, add it // back and make sure it votes. Then we'll do the two different cases of
Where do we make sure it votes?
pkg/storage/client_raft_test.go, line 4749 at r6 (raw file):
// Remove and re-add the RHS to create a new uninitialized replica at // a higher replica ID. This will lead to a tombstone being written.
How does the tombstone get written while the partition is still active?
pkg/storage/client_raft_test.go, line 4754 at r6 (raw file):
// and will be rolled back. Nevertheless it will have learned that it // has been removed at the old replica ID. require.NotNil(t, changeReplicas(roachpb.ADD_REPLICA, keyB, 0))
Can we verify the specific expected error here?
pkg/storage/merge_queue.go, line 294 at r7 (raw file):
return err }
This file only has whitespace changes now.
pkg/storage/replica_application_result.go, line 298 at r7 (raw file):
func (r *Replica) handleChangeReplicasResult( ctx context.Context, chng *storagepb.ChangeReplicas, ) (changeRemovedReplica bool) {
Looks like this return value actually means that the replica was not removed.
pkg/storage/replica_raft.go, line 1343 at r7 (raw file):
// // If this Replica is in the process of being removed this method will return // true and a nil error.
So true
with a non-nil error is impossible? This seems like a special error value might have been a better design.
pkg/storage/store.go, line 2105 at r7 (raw file):
} // Below is just a sanity check. // The only time that this case can happen is if we are currently not
This comment is unclear (what part of the two if
statements is "this case"?)
pkg/storage/store.go, line 2117 at r7 (raw file):
} // Check to make sure we don't know that the RHS has already been removed
The negation here is a little confusing. Add a comment that this is the common case (and I think it might be more clear if they were switched so the uncommon case were the indented conditional).
pkg/storage/store.go, line 2140 at r7 (raw file):
// We might however have already voted at a higher term. In general // this shouldn't happen because we add learners and then promote them // only after we snapshot but for the sake of mixed version clusters and
Can we add a TODO to remove this in the next release, or will we need to keep this around longer?
pkg/storage/store.go, line 2144 at r7 (raw file):
// old RHS and have SplitRange clear its data. // // We need to clear the data which the RHS would have inherited.
Inherited from what? What exactly is this data? How does the split trigger's write of raft state interact with the write here?
pkg/storage/store.go, line 2666 at r7 (raw file):
if !ds.RemovingOrRemoved() { log.Fatalf(ctx, "cannot remove uninitialized replica which is not removal pending: %v", ds)
Should this check for exactly "pending" instead of RemovingOrRemoved?
pkg/storage/store.go, line 3477 at r7 (raw file):
if !drop { isRemoved, err := r.stepRaftGroup(req) if isRemoved {
It looks like all callers of stepRaftGroup do this; maybe it should just return this error when isRemoved is true?
pkg/storage/store_test.go, line 2958 at r7 (raw file):
// Test that we set proper tombstones for removed replicas and use the // tombstone to reject attempts to create a replica with a lesser ID. func TestRemovedReplicaTombstone(t *testing.T) {
Do we have another test that covers the tombstone functionality now?
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
65f9e93
to
7fd2f9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Next set of revisions posted. Typing more testing for the tombstone cases. I'm also curious about how y'all feel re: writing a tombstone in getOrCreateReplica when creating an uninitialized replica with a non-zero replica ID. That would provide the invariant that a store will always have a monotonically increasing set of non-zero replica IDs for a given range even across restart.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/client_raft_test.go, line 4708 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
What do you mean by "traced more"?
Sloppy remnant of code from which this was copied. Removed.
pkg/storage/client_raft_test.go, line 4724 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Where do we make sure it votes?
This was aspirational before I wrote the test. Removed.
This test might be better if it actually at a low level went through the process of adding the node to the group as a voter even if the snapshot failed (which it always will). There's code in this PR that makes such an addition safe.
pkg/storage/client_raft_test.go, line 4749 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
How does the tombstone get written while the partition is still active?
The partition only applies to messages for the LHS. Added some more comments to make it clearer.
pkg/storage/client_raft_test.go, line 4754 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Can we verify the specific expected error here?
Done.
pkg/storage/merge_queue.go, line 294 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This file only has whitespace changes now.
Done.
pkg/storage/replica_application_result.go, line 298 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Looks like this return value actually means that the replica was not removed.
I think it's returning the right thing. We set the destroyStatus to destroyReasonRemovalPending in runPreApplyTriggers so if it's anything other than that then we return. The reason we don't check for alive is we may be in destroyReasonMergePending
in which case we can process more raft requests (though I'm not certain we should). I tried to make the comment clearer. I also updated the testing knob to ensure that we also don't mark the destroyStatus so we just continue to process the remove like before this change. This allowed me to roll back some changes to TestRemoveRangeWithoutGC.
pkg/storage/replica_raft.go, line 1343 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
So
true
with a non-nil error is impossible? This seems like a special error value might have been a better design.
Done.
pkg/storage/store.go, line 2105 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This comment is unclear (what part of the two
if
statements is "this case"?)
Tried to clarify.
pkg/storage/store.go, line 2117 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The negation here is a little confusing. Add a comment that this is the common case (and I think it might be more clear if they were switched so the uncommon case were the indented conditional).
Done.
pkg/storage/store.go, line 2140 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Can we add a TODO to remove this in the next release, or will we need to keep this around longer?
This code needs to exist forever I believe. It has nothing to do with pre-emptive snapshots. I guess the mixed version here was with regards to future versions. I tried to make it clearer, see how you feel about it now.
pkg/storage/store.go, line 2144 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Inherited from what? What exactly is this data? How does the split trigger's write of raft state interact with the write here?
The split trigger above raft synthesizes a number of keys for the RHS the most problematic of which being the truncated state. It was easier to delete everything than to delete everything but the hard state. We could probably assert that the Commit
is 0 rather than setting it. I've tried to improve the comments.
pkg/storage/store.go, line 2666 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Should this check for exactly "pending" instead of RemovingOrRemoved?
It effectively does because of the above conditional but I've made it explicit.
pkg/storage/store.go, line 3477 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
It looks like all callers of stepRaftGroup do this; maybe it should just return this error when isRemoved is true?
Done.
pkg/storage/store_test.go, line 2958 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Do we have another test that covers the tombstone functionality now?
Not yet. Will typing that now for the next revision. Just wanted to get the first set of updates up first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @bdarnell)
pkg/storage/replica_destroy.go, line 38 at r21 (raw file):
// destroy a replica (e.g. processing a ChangeReplicasTrigger removing the // range and receiving a raft message with a higher replica ID). destroyReasonRemovalPending
Very nice simplification.
pkg/storage/store.go, line 2718 at r23 (raw file):
} // removeUninitialzedReplicaRaftMuLocked removes an uninitialzied replica.
Two typos
pkg/storage/store.go, line 3689 at r23 (raw file):
// NB: This response may imply that this range has been added back to // this store at a higher replica ID. We set RangeNotFound error despite // the fact that this newer replica may have already been created
Stale comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @nvanbenschoten)
pkg/storage/store.go, line 2560 at r22 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't want to hold the review up on this, but why do we need to set the destroy status in
runPreApplyTriggers
? Just so we can detect it after the batch has been applied? Is that just a convenience that is causing this complexity here?
That seems like a valid point. It seems mildly awkward to detect the removal in the side effects since the "pre-apply" desc is gone, but there should be a way to detect this that's less leaky.
On the other hand, we do want to have the destroy status set before committing the batch. After all, lots of callers do things with the replica without the raftMu, and it seems like a bad idea to hand folks a replica whose in-mem state is basically bankrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 14 files at r18, 1 of 5 files at r19, 3 of 5 files at r21, 9 of 10 files at r22, 1 of 1 files at r23.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/storage/client_raft_test.go, line 4679 at r18 (raw file):
Previously, ajwerner wrote…
Yes, clarified. This is describing the 4th case.
s/4rd/4th/
pkg/storage/client_raft_test.go, line 1871 at r23 (raw file):
mtc.manualClock.Increment(int64(storage.ReplicaGCQueueInactivityThreshold + 1)) mtc.stores[1].MustForceReplicaGCScanAndProcess() testutils.SucceedsSoon(t, func() error {
Most of the time when we need a retry loop/SucceedsSoon around a replica removal, the call to MustForceReplicaGCScanAndProcess needs to be inside the loop.
pkg/storage/replica.go, line 1088 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
When a new term is received (or when a first term is received) this should lead to a HardState being emitted which reflects the new term; there's no difference between learners and voters. A vote can be cast by a learner when it's asked to vote, but this implies that the sender of the message thought it was talking to a voter (and the vote will only matter if the active config says so).
Preemptive snapshots can cause a HardState.Term (and Commit?) to be written. You'll never have a Vote to write unless you were at some point a valid voter (you may not have known that you were a voter at the time you voted, but the node that asked you to vote knew).
pkg/storage/replica.go, line 1090 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
A learner can get asked to vote if it's already a voter and doesn't know it yet (because it hasn't caught up on the log yet). You shouldn't see a learner write down a
HardState.Vote
unless it has cast a vote, which means someone asked it to vote (and that someone must've thought it was a voter, and used an appropriate replicaID and term for the request).The argument here sounds right: we'd have a problem if the RHS got promoted and cast a vote that we'd then erase. But for that to happen, it needs to accept a snapshot, which it can't because of the overlapping LHS. So all the RHS can put into its HardState is a Term (and no Commit, because there's no raft log). So we'll wipe out a persisted
Term
. I went back to the raft thesis to convince myself that the term needs to be made durable, but now I'm not so sure why. When a vote is cast, one definitely has to remember what term it belongs to, but other than that it seems safe to always take the term from the end of the log when trying to become leader; after all terms are not reusable thanks to the persistence of the <term,vote> pairs. I might be missing something, but if not, the argument above convinces me.
I also can't think of why the current term must be persisted if there is no vote or log entries, but it's explicit in etcd/raft.MustSync. Failure to durably persist a term might cause extra election cycles after restarts, but other than that I can't think of a problem with it.
pkg/storage/replica_application_result.go, line 114 at r23 (raw file):
// result will zero-out the struct to ensure that is has properly performed all // of the implied side-effects. func (r *Replica) prepareLocalResult(ctx context.Context, cmd *replicatedCmd) (isRemoved bool) {
I don't see any assignments to this isRemoved
; the method always returns false.
pkg/storage/replica_application_result.go, line 330 at r23 (raw file):
// the store will no-op when removing a replica which is already marked as removed // unless we set ignoreDestroyStatus to true. ignoreDestroyStatus: true,
This appears to be the only use of DestroyData: false
(unless there are some that use this because it's the default). Should we just make DestroyData: false
imply ignoreDestroyStatus
, or could there be other uses of DestroyData:false
that don't imply "already destroyed, ignore the flag"?
a6d3c37
to
b4b531f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with me y'all. I'm going to try to merge this soon. I'm now hitting failures in stress that exist in master. I'll wait a bit this evening and then bors it. Thanks.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)
pkg/storage/client_raft_test.go, line 4679 at r18 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
s/4rd/4th/
Done.
pkg/storage/client_raft_test.go, line 1871 at r23 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Most of the time when we need a retry loop/SucceedsSoon around a replica removal, the call to MustForceReplicaGCScanAndProcess needs to be inside the loop.
Done. After this change should happen just due to ReplicaTooOldError but did it anyway to avoid confusion.
pkg/storage/replica_application_result.go, line 114 at r23 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't see any assignments to this
isRemoved
; the method always returns false.
Derp I meant to remove the return value, in the last pass. Done.
pkg/storage/replica_application_result.go, line 330 at r23 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This appears to be the only use of
DestroyData: false
(unless there are some that use this because it's the default). Should we just makeDestroyData: false
implyignoreDestroyStatus
, or could there be other uses ofDestroyData:false
that don't imply "already destroyed, ignore the flag"?
I don't have strong feelings here. I sort of like it to be more explicit.
pkg/storage/replica_destroy.go, line 38 at r21 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Very nice simplification.
Done.
pkg/storage/store.go, line 2560 at r22 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That seems like a valid point. It seems mildly awkward to detect the removal in the side effects since the "pre-apply" desc is gone, but there should be a way to detect this that's less leaky.
On the other hand, we do want to have the destroy status set before committing the batch. After all, lots of callers do things with the replica without the raftMu, and it seems like a bad idea to hand folks a replica whose in-mem state is basically bankrupt.
We definitely need to set the destroyStatus before we commit the batch. The destroyStatus controls whether we process requests.
pkg/storage/store.go, line 2718 at r23 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Two typos
Done.
pkg/storage/store.go, line 3689 at r23 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Stale comment
Done.
We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
b4b531f
to
2020115
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r24.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)
TFTR! bors r+ |
40892: storage: ensure Replica objects never change replicaID r=ajwerner a=ajwerner We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 *Replica which IsAlive() for a range on a store at a time. * Once a *Replica has a non-zero ReplicaID is never changes. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes #40367. The first commit is #40889. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Fix crashes by preventing replica ID change. Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
40907: storage: clean up preemptive snapshot when receiving replica ID as learner r=ajwerner a=ajwerner This commit adds an annotation to raft request messages to indicate that the sender believes the current replica is a learner. If the current replica on the recipient was created as a preemptive snapshot (it's initialized but not in the range) then we should remove that replica immediately. The first commit is #40892. Release Justification: Release blocker, required for backwards compatibility. Release note: None Co-authored-by: Andrew Werner <[email protected]>
…atedNextReplicaID Fixes cockroachdb#41145. This bug was introduced in cockroachdb#40892. This may force us to pick a new SHA for the beta. Any ChangeReplicas Raft entry from 19.1 or before is going to crash a node without it. Release justification: fixes a crash in mixed version clusters. Release note: None
…atedNextReplicaID Fixes cockroachdb#41145. This bug was introduced in cockroachdb#40892. This may force us to pick a new SHA for the beta. Any ChangeReplicas Raft entry from 19.1 or before is going to crash a node without it. Release justification: fixes a crash in mixed version clusters. Release note: None
41148: storage: don't crash when applying ChangeReplicas trigger with DeprecatedNextReplicaID r=nvanbenschoten a=nvanbenschoten Fixes #41145. This bug was introduced in #40892. This may force us to pick a new SHA for the beta. Any ChangeReplicas Raft entry from 19.1 or before is going to crash a node without it. Release justification: fixes a crash in mixed version clusters. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
This is an invariant that was established in [cockroachdb#40892]. We now check it more aggressively. There is an awkward exception thanks to uninitialized replicas. [cockroachdb#40892]: cockroachdb#40892 Release note: None fixupassert
This is an invariant that was established in [cockroachdb#40892]. We now check it more aggressively. There is an awkward exception thanks to uninitialized replicas. [cockroachdb#40892]: cockroachdb#40892 Release note: None fixupassert
This is an invariant that was established in [cockroachdb#40892]. We now check it more aggressively. There is an awkward exception thanks to uninitialized replicas. [cockroachdb#40892]: cockroachdb#40892 Release note: None fixupassert
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.
Here's a high level overview of the change:
change.
the current replica will be removed completely.
completely removes itself while applying that command.
of a Replica. Replicas about to be synchronously removed are in
destroyReasonRemovalPending.
This hopefully gives us some new invariants:
at a time.
The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.
Fixes #40367.
The first commit is #40889.
Release justification: This commit is safe for 19.2 because it fixes release
blockers.
Release note (bug fix): Fix crashes by preventing replica ID change.