Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: write replica tombstone on replica creation instead of removal #64966

Closed
nvanbenschoten opened this issue May 10, 2021 · 1 comment
Closed
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 10, 2021

In the following TODO, we detail a hazard that can allow for the ID of a replica on a node to regress:

// TODO(ajwerner): Ideally if this store had ever learned that the replica
// created by the split were removed it would not forget that fact.
// There exists one edge case where the store may learn that it should house
// a replica of the same range with a higher replica ID and then forget.
// If the first raft message this store ever receives for the this range
// contains a replica ID higher than the replica ID in the split trigger
// then an in-memory replica at that higher replica ID will be created and
// no tombstone at a lower replica ID will be written. If the server then
// crashes it will forget that it had ever been the higher replica ID. The
// server may then proceed to process the split and initialize a replica at
// the replica ID implied by the split. This is potentially problematic as
// the replica may have voted as this higher replica ID and when it rediscovers
// the higher replica ID it will delete all of the state corresponding to the
// older replica ID including its hard state which may have been synthesized
// with votes as the newer replica ID. This case tends to be handled safely
// in practice because the replica should only be receiving messages as the
// newer replica ID after it has been added to the range. Prior to learner
// replicas we would only add a store to a range after we've successfully
// applied a pre-emptive snapshot. If the store were to split between the
// preemptive snapshot and the addition then the addition would fail due to
// the conditional put logic. If the store were to then enable learners then
// we're still okay because we won't promote a learner unless we succeed in
// sending a learner snapshot. If we fail to send the replica never becomes
// a voter then its votes don't matter and are safe to discard.
//
// Despite the safety due to the change replicas protocol explained above
// it'd be good to know for sure that a replica ID for a range on a store
// is always monotonically increasing, even across restarts.

The solution to this is described in the TODO and also in discussion in #40892. In that PR, @ajwerner suggested that:

I'm also curious about how y'all feel re: writing a tombstone in getOrCreateReplica when creating an uninitialized replica with a non-zero replica ID. That would provide the invariant that a store will always have a monotonically increasing set of non-zero replica IDs for a given range even across restart.

@bdarnell and @nvanbenschoten both liked the idea but felt it should be deferred to a later change. The time for that change has come.

The plan here is to write a tombstone (see setTombstoneKey) to box out replicas with IDs lower than a given replica when that replica is created, rather than (or in addition to?) when it is destroyed. This ensures that even if the replica is never initialized before a node restarts, a replica with a smaller ID (which is known to be old) cannot be created on the node upon a restart.

Fixing this will cause TestProcessSplitAfterRightHandSideHasBeenRemoved/(4)_initial_replica_RHS_partition,_with_restart to break, because the split will avoid creating a RHS replica with an ID below the ID in the tombstone.

Jira issue: CRDB-7367

@nvanbenschoten nvanbenschoten added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels May 10, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022
@erikgrinaker
Copy link
Contributor

This should be fixed by the RaftReplicaID work in #75761:

// Initialize the Replica with the replicaID.
if err := func() error {
// Check for a tombstone again now that we've inserted into the Range
// map. This double-checked locking ensures that we avoid a race where a
// replica is created and destroyed between the initial unsynchronized
// tombstone check and the Range map linearization point. By checking
// again now, we make sure to synchronize with any goroutine that wrote
// a tombstone and then removed an old replica from the Range map.
if ok, err := storage.MVCCGetProto(
ctx, s.Engine(), tombstoneKey, hlc.Timestamp{}, &tombstone, storage.MVCCGetOptions{},
); err != nil {
return err
} else if ok && replicaID < tombstone.NextReplicaID {
return &roachpb.RaftGroupDeletedError{}
}
// An uninitialized replica should have an empty HardState.Commit at
// all times. Failure to maintain this invariant indicates corruption.
// And yet, we have observed this in the wild. See #40213.
if hs, err := repl.mu.stateLoader.LoadHardState(ctx, s.Engine()); err != nil {
return err
} else if hs.Commit != 0 {
log.Fatalf(ctx, "found non-zero HardState.Commit on uninitialized replica %s. HS=%+v", repl, hs)
}
// Write the RaftReplicaID for this replica. This is the only place in the
// CockroachDB code that we are creating a new *uninitialized* replica.
// Note that it is possible that we have already created the HardState for
// an uninitialized replica, then crashed, and on recovery are receiving a
// raft message for the same or later replica.
// - Same replica: we are overwriting the RaftReplicaID with the same
// value, which is harmless.
// - Later replica: there may be an existing HardState for the older
// uninitialized replica with Commit=0 and non-zero Term and Vote. Using
// the Term and Vote values for that older replica in the context of
// this newer replica is harmless since it just limits the votes for
// this replica.
//
//
// Compatibility:
// - v21.2 and v22.1: v22.1 unilaterally introduces RaftReplicaID (an
// unreplicated range-id local key). If a v22.1 binary is rolled back at
// a node, the fact that RaftReplicaID was written is harmless to a
// v21.2 node since it does not read it. When a v21.2 drops an
// initialized range, the RaftReplicaID will also be deleted because the
// whole range-ID local key space is deleted.
//
// - v22.2: we will start relying on the presence of RaftReplicaID, and
// remove any unitialized replicas that have a HardState but no
// RaftReplicaID. This removal will happen in ReplicasStorage.Init and
// allow us to tighten invariants. Additionally, knowing the ReplicaID
// for an unitialized range could allow a node to somehow contact the
// raft group (say by broadcasting to all nodes in the cluster), and if
// the ReplicaID is stale, would allow the node to remove the HardState
// and RaftReplicaID. See
// https://github.com/cockroachdb/cockroach/issues/75740.
//
// There is a concern that there could be some replica that survived
// from v21.2 to v22.1 to v22.2 in unitialized state and will be
// incorrectly removed in ReplicasStorage.Init causing the loss of the
// HardState.{Term,Vote} and lead to a "split-brain" wrt leader
// election.
//
// Even though this seems theoretically possible, it is considered
// practically impossible, and not just because a replica's vote is
// unlikely to stay relevant across 2 upgrades. For one, we're always
// going through learners and don't promote until caught up, so
// uninitialized replicas generally never get to vote. Second, even if
// their vote somehow mattered (perhaps we sent a learner a snap which
// was not durably persisted - which we also know is impossible, but
// let's assume it - and then promoted the node and it immediately
// power-cycled, losing the snapshot) the fire-and-forget way in which
// raft votes are requested (in the same raft cycle) makes it extremely
// unlikely that the restarted node would then receive it.
if err := repl.mu.stateLoader.SetRaftReplicaID(ctx, s.Engine(), replicaID); err != nil {
return err
}
return repl.loadRaftMuLockedReplicaMuLocked(uninitializedDesc)
}(); err != nil {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

No branches or pull requests

3 participants