-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: use raft learners in replica addition, defaulted off
A learner is a participant in a raft group that accepts messages but doesn't vote. This means it doesn't affect raft quorum and thus doesn't affect the fragility of the range, even if it's very far behind or many learners are down. At the time of writing, learners are used in CockroachDB as an interim state while adding a replica. A learner replica is added to the range via raft ConfChange, a raft snapshot (of type LEARNER) is sent to catch it up, and then a second ConfChange promotes it to a full replica. This means that learners are currently always expected to have a short lifetime, approximately the time it takes to send a snapshot. Ideas have been kicked around to use learners with follower reads, which could be a cheap way to allow many geographies to have local reads without affecting write latencies. If implemented, these learners would have long lifetimes. For simplicity, CockroachDB treats learner replicas the same as voter replicas as much as possible, but there are a few exceptions: - Learner replicas are not considered when calculating quorum size, and thus do not affect the computation of which ranges are under-replicated for upreplication/alerting/debug/etc purposes. Ditto for over-replicated. - Learner replicas cannot become raft leaders, so we also don't allow them to become leaseholders. As a result, DistSender and the various oracles don't try to send them traffic. - The raft snapshot queue does not send snapshots to learners for reasons described below. - Merges won't run while a learner replica is present. Replicas are now added in two ConfChange transactions. The first creates the learner and the second promotes it to a voter. If the node that is coordinating this dies in the middle, we're left with an orphaned learner. For this reason, the replicate queue always first removes any learners it sees before doing anything else. We could instead try to finish off the learner snapshot and promotion, but this is more complicated and it's not yet clear the efficiency win is worth it. This introduces some rare races between the replicate queue and AdminChangeReplicas or if a range's lease is moved to a new owner while the old leaseholder is still processing it in the replicate queue. These races are handled by retrying if a learner disappears during the snapshot/promotion. If the coordinator otherwise encounters an error while sending the learner snapshot or promoting it (which can happen for a number of reasons, including the node getting the learner going away), it tries to clean up after itself by rolling back the addition of the learner. There is another race between the learner snapshot being sent and the raft snapshot queue happening to check the replica at the same time, also sending it a snapshot. This is safe but wasteful, so the raft snapshot queue won't try to send snapshots to learners. Merges are blocked if either side has a learner (to avoid working out the edge cases) but it's historically turned out to be a bad idea to get in the way of splits, so we allow them even when some of the replicas are learners. This orphans a learner on each side of the split (the original coordinator will not be able to finish either of them), but the replication queue will eventually clean them up. Learner replicas don't affect quorum but they do affect the system in other ways. The most obvious way is that the leader sends them the raft traffic it would send to any follower, consuming resources. More surprising is that once the learner has received a snapshot, it's considered by the quota pool that prevents the raft leader from getting too far ahead of the followers. This is because a learner (especially one that already has a snapshot) is expected to very soon be a voter, so we treat it like one. However, it means a slow learner can slow down regular traffic, which is possibly counterintuitive. Release note (general change): Replicas are now added using a raft learner and going through the normal raft snapshot process to catch them up, eliminating technical debt. No user facing changes are expected.
- Loading branch information
Showing
27 changed files
with
1,189 additions
and
105 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.