Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: must never reuse state from old replicaIDs #40367

Closed
tbg opened this issue Aug 30, 2019 · 4 comments · Fixed by #40892
Closed

storage: must never reuse state from old replicaIDs #40367

tbg opened this issue Aug 30, 2019 · 4 comments · Fixed by #40892
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Aug 30, 2019

I was looking at the range merge tests and in particular
TestStoreRangeMergeAddReplicaRace (at its time of introduction; it has since
changed due to learners and lost a lot of its context), see

ce93650

The worry is roughly that when a snapshot that does not contain a merge is
sitting on some store (which is not part of the range) and this store is then
added to the range, the replica will likely blow up while applying the merge
trigger, is the subsumed replica is very unlikely to be present - this colocation
is only achieved for the duration of the merge txn and for the stores that
are members of the range.

At the time of the below comment, the way such a situation would come to be is
because preemptive snapshots were sent before starting the replica change txn.
So you'd send a snapshot, then do a merge, and (before we introduced the range
desc generation for that purpose) the upreplication could succeed, boom.

However, this sequence of events might unfold just the same without any preemptive snaps:

  1. add replica on store
  2. remove replica from store; replicaGC is delayed
  3. run a merge
  4. readd store

Here, in 4 I don't mean "re-add as a learner and promote to voter"! It's enough to
readd it as a learner. Raft will start catching up the old snapshot and boom, we
hit the problem above.

This actually seems like a likely problem to occur in practice if replicas are
removed and re-added frequently while merges occur, though the fact that we still
send explicit learner snapshots could mask it somewhat (that snapshot will postdate
the merge, so it can hide the problem by fast-forwarding the old replica past the
merge).

Fixing this is probably not easy. We want to "not reuse old snapshots", i.e. when the
replicaID changes we want to delete all our state and start anew. This is far from a
trivial change, but it's in the spirit of changes that we want to make anyway to lower
the complexity in the storage package. Anyway the first step here is reproducing the
problem.

cc @danhhz @nvanbenschoten

// TestStoreRangeMergeAddReplicaRace verifies that an add replica request that
// occurs concurrently with a merge is aborted.
//
// To see why aborting the add replica request is necessary, consider two
// adjacent and collocated ranges, Q and R. Say the replicate queue decides to
// rebalance Q onto store S4. It will initiate a ChangeReplicas command that
// will send S4 a preemptive snapshot, then launch a replica-change transaction
// to update R's range descriptor with the new replica. Now say the merge queue
// decides to merge Q and R after the preemptive snapshot of Q has been sent to
// S4 but before the replica-change transaction has started. The merge can
// succeed because the ranges are still collocated. (The new replica of Q is
// only considered added once the replica-change transaction commits.) If the
// replica-change transaction were to commit, the new replica of Q on S4 would
// have a snapshot of Q that predated the merge. In order to catch up, it would
// need to apply the merge trigger, but the merge trigger will explode because
// S4 does not have a replica of R.
//
// To avoid this scenario, ChangeReplicas commands will abort if they discover
// the range descriptor has changed between when the snapshot is sent and when
// the replica-change transaction starts.
//
// There is a particularly diabolical edge case here. Consider the same
// situation as above, except that Q and R merge together and then split at
// exactly the same key, all before the replica-change transaction starts. Q's
// range descriptor will have the same start key, end key, and next replica ID
// that it did when the preemptive snapshot started. That is, it will look
// unchanged! To protect against this, range descriptors contain a generation
// counter, which is incremented on every split or merge. The presence of this
// counter means that ChangeReplicas commands can detect and abort if any merges
// have occurred since the preemptive snapshot, even if the sequence of splits
// or merges left the keyspan of the range unchanged. This diabolical edge case
// is tested here.
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Aug 30, 2019
@tbg tbg self-assigned this Aug 30, 2019
tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2019
ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor
that they use as a basis for a CPut to make sure the operations mutating
the range serialize. This is great for correctness but generally
unfortunate for usability since on a mismatch, the caller usually wanted
to do the thing they were trying to do anyway, using the new descriptor.
The fact that every split (replication change, merge) basically needs a
retry loop is constant trickle of test flakes and UX papercuts.

It became more pressing to do something against this as we are routinely
using joint configs when atomic replication changes are enabled. A joint
configuration is transitioned out of opportunistically whenever it is
encountered, but this frequently causes a race in which actor A finds
a joint config, begins a transaction out of it but is raced by actor B
getting there first. The end result is that what actor A wanted to
achieve has been achieved, though by someone else, and the result
is a spurious error.

This commit fixes that behavior in the targeted case of wanting to leave
a joint configuration: actor A will get a successful result.

Before this change,

make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal

would fail immediately when `kv.atomic_replication_changes.enabled` was
true because the splits this test carries out would run into the joint
configuration changes of the initial upreplication, and would race the
replicate queue to transition out of them, which at least one split
would typically lose. This still happens, but now it's not an error any
more.

I do think that it makes sense to use a similar strategy in general
(fail replication changes only if the *replicas* changed, allow all
splits except when the split key moves out of the current descriptor,
etc) but in the process of coding this up I got reminded of all of
the problems relating to range merges and also found what I think is
a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything
until the release is out of the door. But I'm basically convinced that
if we did it, it wouldn't cause a new "bug" because any replication
change carried out in that way is just one that could be triggered
just the same by a user under the old checks.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 3, 2019
ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor
that they use as a basis for a CPut to make sure the operations mutating
the range serialize. This is great for correctness but generally
unfortunate for usability since on a mismatch, the caller usually wanted
to do the thing they were trying to do anyway, using the new descriptor.
The fact that every split (replication change, merge) basically needs a
retry loop is constant trickle of test flakes and UX papercuts.

It became more pressing to do something against this as we are routinely
using joint configs when atomic replication changes are enabled. A joint
configuration is transitioned out of opportunistically whenever it is
encountered, but this frequently causes a race in which actor A finds
a joint config, begins a transaction out of it but is raced by actor B
getting there first. The end result is that what actor A wanted to
achieve has been achieved, though by someone else, and the result
is a spurious error.

This commit fixes that behavior in the targeted case of wanting to leave
a joint configuration: actor A will get a successful result.

Before this change,

make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal

would fail immediately when `kv.atomic_replication_changes.enabled` was
true because the splits this test carries out would run into the joint
configuration changes of the initial upreplication, and would race the
replicate queue to transition out of them, which at least one split
would typically lose. This still happens, but now it's not an error any
more.

I do think that it makes sense to use a similar strategy in general
(fail replication changes only if the *replicas* changed, allow all
splits except when the split key moves out of the current descriptor,
etc) but in the process of coding this up I got reminded of all of
the problems relating to range merges and also found what I think is
a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything
until the release is out of the door. But I'm basically convinced that
if we did it, it wouldn't cause a new "bug" because any replication
change carried out in that way is just one that could be triggered
just the same by a user under the old checks.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 4, 2019
ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor
that they use as a basis for a CPut to make sure the operations mutating
the range serialize. This is great for correctness but generally
unfortunate for usability since on a mismatch, the caller usually wanted
to do the thing they were trying to do anyway, using the new descriptor.
The fact that every split (replication change, merge) basically needs a
retry loop is constant trickle of test flakes and UX papercuts.

It became more pressing to do something against this as we are routinely
using joint configs when atomic replication changes are enabled. A joint
configuration is transitioned out of opportunistically whenever it is
encountered, but this frequently causes a race in which actor A finds
a joint config, begins a transaction out of it but is raced by actor B
getting there first. The end result is that what actor A wanted to
achieve has been achieved, though by someone else, and the result
is a spurious error.

This commit fixes that behavior in the targeted case of wanting to leave
a joint configuration: actor A will get a successful result.

Before this change,

make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal

would fail immediately when `kv.atomic_replication_changes.enabled` was
true because the splits this test carries out would run into the joint
configuration changes of the initial upreplication, and would race the
replicate queue to transition out of them, which at least one split
would typically lose. This still happens, but now it's not an error any
more.

I do think that it makes sense to use a similar strategy in general
(fail replication changes only if the *replicas* changed, allow all
splits except when the split key moves out of the current descriptor,
etc) but in the process of coding this up I got reminded of all of
the problems relating to range merges and also found what I think is
a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything
until the release is out of the door. But I'm basically convinced that
if we did it, it wouldn't cause a new "bug" because any replication
change carried out in that way is just one that could be triggered
just the same by a user under the old checks.

Release note: None
@tbg tbg changed the title storage: must never reuse snapshots from old replicaIDs storage: must never reuse state from old replicaIDs Sep 6, 2019
@tbg
Copy link
Member Author

tbg commented Sep 6, 2019

Another aspect of this came up in https://github.com/cockroachdb/cockroach/pull/39796/files. (see commit message of the testing commit). If a peer is removed, it seems possible that it would be caught up past the replication change that removes it. At the time of writing, this happens routinely when peers are removed and re-added without undergoing replicaGC in the meantime, but it's always an awkward situation to handle. With #40459 it will become the exception, so we want to make sure that it's not possible at all.

@ajwerner this is closely related to #40459. There, I think your latest approach drops incoming raft messages until the state has been GC'ed. What we want to do here is basically that when a replica sees itself get removed in a ChangeReplicasTrigger log entry, it needs to replicaGC itself. There's not anything to "drop" really, so I don't think we can use the strategy of "stalling" until the replicaGCQueue comes along. We sort of need the replica to just stop existing in-place, I think. Curious if you had any thoughts on this.

tbg added a commit to tbg/cockroach that referenced this issue Sep 6, 2019
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

There is one more bug here that is not addressed by this change,
namely the fact that there might be a tombstone for the right
hand side. This continues to be tracked in cockroachdb#40470.

See also
cockroachdb#40367 (comment)
for more work we need to do to establish sanity.

Release note: None
@ajwerner
Copy link
Contributor

ajwerner commented Sep 6, 2019

We sort of need the replica to just stop existing in-place, I think. Curious if you had any thoughts on this.

Just to check my understanding, the issue is that we need this replica to stop applying raft entries as soon as it notices that it has been removed.

func (r *Replica) handleChangeReplicasResult(ctx context.Context, chng *storagepb.ChangeReplicas) {
storeID := r.store.StoreID()
var found bool
for _, rDesc := range chng.Replicas() {
if rDesc.StoreID == storeID {
found = true
break
}
}
if !found {
// This wants to run as late as possible, maximizing the chances
// that the other nodes have finished this command as well (since
// processing the removal from the queue looks up the Range at the
// lease holder, being too early here turns this into a no-op).
r.store.replicaGCQueue.AddAsync(ctx, r, replicaGCPriorityRemoved)
}
}

It seems like if the replica were to set its destroyStatus and then find a way to bail out of entry application at that point then we'd be okay waiting for replicaGC.

Additionally we should probably ignore ticks for destroyed replicas.

Another concern I have is the tight loop spinning waiting for replicaGC in tryGetOrCreateReplica():

} else if ds := repl.mu.destroyStatus; ds.reason == destroyReasonRemoved {

@ajwerner
Copy link
Contributor

ajwerner commented Sep 6, 2019

It seems like we're ultimately trying to eliminate setReplicaID(). Is there any reason why we want that method to continue to exist after we have learners and we prevent processing entries across a remove and re-add?

@tbg
Copy link
Member Author

tbg commented Sep 6, 2019

Just to check my understanding,

Yep, exactly. We don't want it to advance on the log past its removal. In fact we'd do well to not even process the removal itself (or it will be harder to manage the case in which a node restarts immediately after).

It seems like if the replica were to set its destroyStatus and then find a way to bail out of entry application at that point then we'd be okay waiting for replicaGC.

Yeah, that's the part I wasn't sure we could achieve reasonably well. It seems conceptually OK (propagate a "stop" flag when we see a raft command that would remove the replica from the desc and handle that in the right place to set the destroy status, and make it such that replicas which are destroyed never handle readies)

Also cc @danhhz because we spent many a merry 1:1 discussing how good it would be if we had this invariant about replicas always being in their range descriptor, and now it seems that we might be getting it earlier than we thought it possible.

tbg added a commit to tbg/cockroach that referenced this issue Sep 6, 2019
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

There is one more bug here that is not addressed by this change,
namely the fact that there might be a tombstone for the right
hand side. This continues to be tracked in cockroachdb#40470.

See also
cockroachdb#40367 (comment)
for more work we need to do to establish sanity.

Release note: None
@ajwerner ajwerner assigned ajwerner and unassigned tbg Sep 12, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 13, 2019
WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.

On the plus side it does not seem to reproduce any crashes in hours with the
`partitionccl.TestRepartitioning` which readily produces crashes on master
when run under roachprod stress within ~20 minutes.

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.
 * The queues are now replica ID aware. If a replica was added to the queue
   and the replica found when trying to pop are not the same and we knew the
   replica ID of replica when we added it then we should not process it.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 13, 2019
WARNING: this change needs more testing to target the changes it makes
and at least some of the disabled tests should be reworked. This is a big
and scary change at this point in the cycle so I'm getting it out before
I'm really happy with it. There are some known TODOs.

On the plus side it does not seem to reproduce any crashes in hours with the
`partitionccl.TestRepartitioning` which readily produces crashes on master
when run under roachprod stress within ~20 minutes.

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.
 * The queues are now replica ID aware. If a replica was added to the queue
   and the replica found when trying to pop are not the same and we knew the
   replica ID of replica when we added it then we should not process it.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 20, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
craig bot pushed a commit that referenced this issue Sep 24, 2019
40892:  storage: ensure Replica objects never change replicaID r=ajwerner a=ajwerner

We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 *Replica which IsAlive() for a range on a store
   at a time.
 * Once a *Replica has a non-zero ReplicaID is never changes.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes #40367.

The first commit is #40889.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 2020115 Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
2 participants