Merge cockroachdb#72527

72527: docs: RFC loss of quorum recovery r=aliher1911 a=aliher1911 RFC describes proposed approach to recovering data ranges that lost consensus in corresponding raft groups because of permanent loss of too many replicas. Release note: None Co-authored-by: Oleg Afanasyev <[email protected]>
irfansharif · Apr 4, 2022 · 5b4a7b0 · 5b4a7b0
2 parents ca927f0 + 4179a8f
commit 5b4a7b0
Showing 1 changed file with 237 additions and 0 deletions.
diff --git a/docs/RFCS/20211103_loss_of_quorum_recovery.md b/docs/RFCS/20211103_loss_of_quorum_recovery.md
@@ -0,0 +1,237 @@
+- Feature Name: Inconsistent Recovery from Permanent Loss of Quorum
+- Status: accepted
+- Start Date: 2021-09-23
+- Authors: Tobias Grieger, Oleg Afanasyev
+- RFC PR: [#72527](https://github.com/cockroachdb/cockroach/pull/72527)
+- Cockroach Issue: https://github.com/cockroachdb/cockroach/issues/17186
+
+# Motivation and Summary
+
+Our story around recovering from loss of quorum is officially nonexistent (“restore from a backup into a new cluster” is
+the official guidance), and yet internal tools exist and are used with some regularity. Our customers (internal and
+external) are understandably asking for official tools and guidelines. Even if it may be hard, at least in the 22.1
+timeframe, to improve our tooling to be suitable for end users, significant improvements are possible.
+
+We propose a plan to productionize our tooling for recovery from loss of quorum for use in situations in which operators
+want to restore service and allow as much data to be recovered, without regard for its consistency. The plan revolves
+around extensions of `cockroach debug unsafe-remove-dead-replicas` that primarily improve the UX, detect (& avoid doing
+harm in) situations in which the tool cannot be used, and give an indication of the internal state of the cluster
+expected after the recovery operation.
+
+We will have an official, documented, and well-tested ergonomical tool that customers both internal and external can
+use. This will fortify our story around high availability. The tools will form a solid foundation for improved
+functionality, such as making the process “push-button” against the running cluster. There is also a possibility to
+reuse parts of the suggested machinery to build a tool to roll back a production cluster to a known consistent state.
+
+# Technical design
+
+## Status Quo
+
+A quick reminder about the status quo. The recovery protocol is:
+
+* Determine the dead nodes A, B, …
+* Stop all nodes in the cluster
+* Invoke `./cockroach debug unsafe-remove-dead-replicas --dead-store-ids=A,B,...` on each store in the system
+* Restart the cluster.
+
+`unsafe-remove-dead-replicas` iterates through all Range Descriptors on the store. It then makes the (unjustified, but
+often enough correct) assumptions that for any Range needing recovery,
+
+* **(AllLive)** Any Replica found on any store is live (i.e. not waiting for replicaGC)
+* **(Covering)** There is a replica for each range that needs recovery, i.e. we can cover the entire keyspace.
+* **(ConsistentRangeDescriptor)** All Replicas of a given Range (across multiple stores) have the same view of the range
+  descriptor.
+* **(RaftLog)** the surviving replicas’ applied state is not “missing” any committed operations that affect the
+  RangeDescriptor. (Such a missing operation could be in the follower’s log but be unapplied, or the follower could not
+  even have learned about it).
+
+On each store, based on **(ConsistentRangeDescriptor)**, the process can deterministically (across all stores) determine
+from the RangeDescriptor (and the input set of lost nodes) the ranges that need recovery and choose per-range designated
+survivors. At present, it chooses from the surviving Replicas than which is the voter with the highest StoreID. On this
+Replica, it rewrites the Replica Descriptor in-place to contain only that voter as its sole member. When that store then
+comes back online, and the system ranges of the cluster are functional, that range can thus make progress. It will
+up-replicate and in doing so fix up the meta2 entries (which initially reflect the outdated range descriptor). Fixing up
+the meta2 record will lead to any additional non-chosen survivors to become eligible for replicaGC.
+
+The above process is how the tool was originally conceived of. We realized that it can also be used in a rolling-restart
+fashion. The upshot of that strategy is that it avoids downtime of the entire cluster in cases where only a few ranges
+need recovery. In that case, the strategy is as follows:
+
+* Decide which range to recover first. This doesn’t matter much if all ranges are user ranges, but if one of them is a
+  critical system range, that one needs to come first - as availability of system ranges is a prerequisite for rejoining
+  the cluster. (For example, if the liveness range is down, the cluster is effectively down and restarted nodes won’t
+  get past the boot sequence; recovering the liveness range is going to be the most important action)
+* Take down the node which is going to be the designated survivor for the chosen range (i.e. voter with highest storeID)
+  and run the tool on it.
+* Restart that node. The chosen range should now be available again and upreplicate.
+* Repeat the process until all ranges are recovered.
+
+A caveat observed with this approach has been that if there are multiple replicas remaining for a range (as likely the
+case with system ranges, which prefer 5x replication), bringing the designated survivor back does not necessarily
+restore availability for that range. This is because the non-designated survivors are still “around” initially and are
+present both in the meta2 records and cached in the DistSenders. Any requests routed to them will hang (or, once we
+have https://github.com/cockroachdb/cockroach/issues/33007, fail-fast), which requires targeted additional restarts of
+these non-designated survivors to unblock service. In short, the rolling-restart strategy works, but it’s very manual
+and in the worst case causes longer downtime than the stop-the-world approach.
+
+## Step 1: Improved Offline Approach
+
+We first fortify the stop-the-world approach for safety and improved choice of designated survivors.
+
+### Data Collection
+
+We pointed out above that in the status quo, lots of assumptions are made about the state being consistent across the
+various stores, which allowed for local decision-making. This assumption is obviously not going to hold in general, and
+the possible problems arising from this make us anxious about proliferation of usage of the tool. We propose to add an
+explicit distributed data collection step that allows picking the “best” designated survivor and that also catches
+classes of situations in which the tool could cause undesired behavior. When such a situation is detected, the tool
+would refuse to work, and an override would have to be provided; one would be in truly unchartered territory in this
+case, with no guarantees given whatsoever. Over time, we can then evolve the tool to handle more corner cases.
+
+The distributed data collection step is initially an offline command, i.e. a
+command `./cockroach debug recover collect-info -store=<store>` is run directly against the store directories. Much of
+the current code in `unsafe-remove-dead-replicas` is reused: an iteration over all range descriptors takes place, but
+now information about all of the replicas is written to a file or printed to stdout.
+
+This includes:
+- Current StoreID
+- RangeDescriptor
+- RangeAppliedIndex
+- RaftCommittedIndex
+- Whether the unapplied Raft log contains any changes to the range descriptor.
+
+These files are collected in a central location, and we proceed to the next step:
+
+### Making a Plan
+
+`./cockroach debug recover make-plan <files>` is invoked. This
+
+* Prunes the set of input replicas to be generationally consistent. For example, if we “only” have a replica r1=[a,z) at
+  generation 5 and a replica r2=[c,z) at generation 7, we have to discard r1 (since r1@gen5 is an old replica; it must
+  have split, perhaps multiple times, to become a left hand side that touches r2). This leaves a hole [a,c) in the
+  keyspace, i.e. would fail the next step.  
+  Note that this step also eliminates duplicate survivors whose range descriptor does not agree. For example, if we have
+  a replica r1=[a,z)@gen5 and r1=[a,z)@gen7, we’ll eliminate the first. Note that this is an example of a general class
+  of scenarios in which a split happened that the survivors don’t know about; in this case the right-hand side might
+  exist in the cluster (and may not even need recovery). We thus detect this and will fail below, due to a hole in the
+  keyspace.  
+  This step ensures that *(AllLive)* and *(ConsistentRangeDescriptor)* are true for the remaining set of replicas.
+* For all ranges that have multiple replicas surviving, pick one with the highest RaftCommittedIndex, ties for
+  RaftCommittedIndex would be resolved in favor of highest StoreID. (See “Follow-up Work” for a comment on this)
+* Verifies that the remaining set of replicas covers the entire keyspace. Due to the generational consistency, there
+  will be no partial overlap, i.e. we have a partition if there are no holes. This verifies **(Covering)**.
+* If any of the remaining replicas indicates an unapplied change to the range descriptor in the log, complain. This
+  verifies **(RaftLog)**. (We can likely do something more advanced here, but let’s delegate for later).
+* Write out this resulting partition to a file; also separately list the replicas that were *not* chosen. This is the
+  recovery plan. If there were any problems above, don’t write the plan unless a `--force` flag is specified.
+
+Note that choosing the highest CommittedIndex isn’t necessarily the best option. If we are to determine with full
+accuracy the most up-to-date replica, we need to either perform an offline version of Raft log reconciliation (which
+requires O(len-unapplied-raft-log) of space in the data collection in the worst case, i.e. a non-starter) or we need to
+resuscitate multiple survivors per range if available (but this raises questions about nondeterminism imparted by the
+in-place rewriting of the RangeDescriptor). Conceptually it would be desirable to replace the rewrite step with an
+“append” to the RaftLog.
+
+### Rewriting
+
+Now it’s time to rewrite range descriptors. The recovery plan is distributed to each node, and on each
+store `./cockroach recover apply-plan --store=<store> <plan-file>` is invoked. This first creates an LSM checkpoint,
+just in case. Then, for all replicas in the plan for which the local StoreID is chosen as the designated survivor, the
+rewrite of the range descriptor is carried out to have the designated survivor as the single voter. All other replicas
+of ranges that need recovery but for which the local store’s replica was not chosen are replicaGC’ed; this is done to
+prevent these replicas from trying to handle requests once the store rejoins the cluster. In addition to the above step,
+we also mark each store as having undergone a recovery procedure, for example by writing a store-local marker key, or by
+copying the recovery plan into the store’s auxiliary data directory. This can then be included in debug.zip and metrics
+to indicate to our Support team that inconsistent recovery was applied to the cluster should it cause problems in the
+future.
+
+## Step 2: Half-Online approach
+
+The stop-the-world approach can be unnecessarily intrusive.
+
+The cluster doesn't have to be stopped for the full duration of all the recovery stages, It only needs to be stopped to
+apply a change to the store once unavailable replicas are identified and recovery operations planned. Moreover, since
+ranges that lost quorum can't progress anyway, update could be performed in a rolling fashion.
+
+With this in mind, recovery procedure could be automated to:
+
+* Create recovery plan by pointing cli at a single node. The node or the cli itself would do a fanout to collect the
+  data. Data is similar to what Ranges endpoint provides and might be even sufficient for replica info collection
+  purposes.
+  * Note that collection phase should only be using gossip and Ranges RPC or the like so that it could still work when
+    KV is unavailable due to system ranges losing quorum.
+* Copy plan to each node.
+* Operator performs rolling restart of nodes the usual way.
+
+This approach avoids all the error-prone manual steps to collect and distribute data, but also avoids the “harder”
+engineering problem of applying a recovery plan to a running cluster.
+
+We keep the fully-offline step since it’s a good fallback; nodes may not even start or connect gossip when things are
+sufficiently down.
+
+## Drawbacks
+
+Once unsafe recovery tool is readily available, operators could start using it when not appropriate if it proves
+successful. As data inconsistency will not always happen and that would give a sense of false safety.
+
+## Follow-up work
+
+In both Offline and Half-Online approach we could provide a more detailed report of what data is potentially
+inconsistent after the recovery. Report would contain data about which tables, indexes and key ranges are affected. E.g.
+table foo.bar with keys in range ['a', 'b'). And provide recommendations on how to restore data consistency where
+possible. E.g. secondary indices affected by range recovery could be dropped and recreated.
+
+As a next step, we can make the tool fully-online, i.e. automate the data collection and recovery step behind a cli/API
+endpoint, thus providing a "panic button" that will try to get the cluster back to an available (albeit possibly
+userdata-inconsistent) state from which a new cluster can be seeded. This must not use most KV primitives (to avoid
+relying on any parts of the system that are unavailable), instead we need to operate at the level of Gossip node
+membership (for routing) and special RPC endpoints. The range report (ranges endpoint) should be sufficient for the data
+collection phase with some hardening and possibly inclusion of some additional pieces of information. The active
+recovery step can be carried out by a store-addressed recovery RPC that instructs the store to eject the in-memory
+incarnations of the designated-survivor-replicas from the Store, replacing them with placeholders owned by the recovery
+RPC, which will thus have free reign over the portion of the engines owned by these replicas (and it can thus carry out
+the same operations unsafe-remove-dead-replicas would).
+
+We can also explore situations in which stages one and two refuse to carry out the recovery and provides solutions at
+least to some of them. However, due to the nature of data loss, almost any internal invariant KV or the SQL subsystem
+have can be violated, so a "provably safe" recovery protocol is not the goal.
+
+### Picking the optimal survivor
+
+In the current proposal, in the presence of multiple surviving followers for a range we pick the one with the largest
+committed index as the designated survivor. This isn’t always the best option - followers have a “lagging” view of what
+is committed, so a more up-to-date follower could have a smaller committed index than another, less up-to-date follower.
+Determining the optimal choice requires performing Raft log reconciliation, which requires adding the indexes of term
+changes to the output of the data collection step. It seems awkward to duplicate parts of the Raft logic, and so we
+caution against taking this step. The CommitedIndex approach will generally deliver near-optimal results. We could
+consider picking the largest LastIndex instead. This too has the (theoretical) potential to be the wrong choice: if a
+new leader takes over and aborts a large uncommitted log tail, we may pick the follower that has not learned about the
+existence of this new leader yet, and would thus pick the divergent log tail, possibly not only missing writes but
+outright committing a number of writes that never should have taken effect.
+
+### Best effort consistent recovery to the state in the past
+
+Stage four is a confluence of consistent and inconsistent recovery. By including resolved timestamp information in the
+data collection step, we can stage a stop-the-world restart of the cluster which bootstraps a new cluster in-place while
+preserving the user data and range split boundaries. (Certain exceptions apply, for example non-MVCC operations may
+cause anomalies, and of course each part of the keyspace needs to have at least one replica remaining). This tool also
+paves, in principle, a way for backing up and restoring a cluster without any data movement, using (suitably enhanced to
+capture some extra state atomically) LSM snapshots.
+
+# Explain it to folk outside of your team
+
+CockroachDB being an ACID compliant database always prioritizes consistency of the data over availability. Internaly we
+rely on raft consensus protocol between replicas of the data to achieve low level consistency. This means that if
+consensus can't be achieved due to permanent loss of nodes database can not proceed any further. While this is a
+desirable property, some clients prefer to relax it for the cases where downtime is unacceptable or inconsistency could
+be tolerated temporarily.
+
+This proposal describes possible solution based on the internal approach and ways to make it generally usable and less
+error-prone by adding automation and safeguards.
+
+# Unresolved questions
+
+Best way to collect information about replicas in half-online and online approach - should cli fan-out to all nodes as
+instructed or should it delegate to a single node to perform fan-out operations. The tradeoff is cli fan-out will
+require operator to provide all node addresses to guarantee collection, but that would work in all cases. Maybe we can
+do both approaches with cli fan-out being a fallback if we can't collect from all nodes.