-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: fix auto-transitioning out of joint config #11046
Conversation
When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None
When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None
When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg This PR contains a series of commits that first pave for the way and ultimately allow carrying out atomic replication changes via Raft joint consensus. Atomic replication changes are required to avoid entering unsafe configurations during lateral data movement. See #12768 for details; this is a problem we want to address in 19.2. Before merging this we'll need to sort out an upstream change in Raft which has made a bug in our code related to learner snapshots much more likely; the offending upstream commit is patched out of the vendored etcd bump in this PR at the time of writing. An antichronological listing of the individual commits follows. They should be reviewed individually, though it may be helpful to look at the overall diff for overall context. A modest amount of churn may exist between the commits, though a good deal of effort went into avoiding this. storage: allow atomic replication changes in ChangeReplicas They default to OFF. This needs a lot more tests which will be added separately in the course of switching the default to ON and will focus on the interactions of joint states with everything else in the system. We'll also need another audit of consumers of the replica descriptors to make sure nothing was missed in the first pass. Release note: None storage: fix replicaGCQueue addition on removal trigger Once we enter joint changes, the replica to be removed will show up in `crt.Removed()` when the joint state is entered, but it only becomes eligible for actual removal when we leave the joint state later. The new code triggers at the right time, namely when the replica is no longer in the descriptor. Release note: None storage: let execChangeReplicasTxn construct the descriptor Prior to this commit, the method took both an old and a new desc *plus* slices of added and removed replicas. This had grown organically, wasn't an easily understood interface, led to repetitive and tricky code at the callers, and most importantly isn't adequate any more in a world with atomic replication changes, where execChangeReplicasTxn in constructing the ChangeReplicasTrigger is essentially deciding whether a joint configuration needs to be entered (which in turn determines what the descriptor needs to look like in the first place). To start solving this, let execChangeReplicasTxn create (and on success return) the new descriptor. Callers instead pass in what they want to be done, which is accomplished via an []internalReplicationChange slice. Release note: None roachpb: auto-assign ReplicaID during AddReplica This is a cleanup leading up to a larger refactor of the contract around `execChangeReplicasTxn`. Release note: None storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate This prepares the trigger -> raft translation code to properly handle atomic replication changes. This carries out a lot of validation to give us confidence that any unusual transitions would be caught quickly. This change also establishes more clearly which added and removed replicas are to be passed into the trigger when transitioning into a joint configuration. For example, when adding a voter, one technically replaces a Learner with a VoterIncoming and so the question is which type the replica in the `added` slice should have. Picking the Learner would give the trigger the most power to validate the input, but it's annoying to have divergent descriptors floating around, so by convention we say that it is always the updated version of the descriptor (i.e. for fully removed replicas, just whatever it was before it disappeared). I spent more time on this than I'm willing to admit, in particular looking removing the redundancy here, but it made things more awkward than was worth it. Release note: None storage: push replication change unrolling into ChangeReplicas There are various callers to ChangeReplicas, so it makes more sense to unroll at that level. The code was updated to - in principle - do the right thing when atomic replication changes are requested, except that they are still unimplemented and a fatal error will serve as a reminder of that. Of course nothing issues them yet. Release note: None storage: skip ApplyConfChange on rejected entry When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None storage: un-embed decodedConfChange I ate a number of NPEs during development because nullable embedded fields are tricky; they hide the pointer derefs that often need a nil check. We'll embed the fields of decodedConfChange instead which works out better. This commit also adds the unmarshaling code necessary for ConfChangeV2 needed once we issue atomic replication changes. Release note: None storage: add learners one by one Doing more than one change at once is going to force us into an atomic replication change. This isn't crazy, but seems unnecessary at this point, so just add the learners one by one. Release note: None storage: add fatals where atomic conf changes are unsupported These will be upgraded with proper handling when atomic replication changes are actually introduced, but for now it's convenient to stub out some code that will need to handle them and to make sure we won't forget to do so later. Release note: None storage: add atomic replication changes cluster setting This defaults to false, and won't have an effect unless the newly introduced cluster version is also active. Release note: None roachpb: support zero-change ChangeReplicasTrigger We will use a ChangeReplicasTrigger without additions and removals when transitioning out of a joint configuration, so make sure it supports this properly. Release note: None roachpb: return "desired" voters from ReplicaDescriptors.Voters Previous commits introduced (yet unused) voter types to encode joint consensus configurations which occur during atomic replication changes. Access to the slice of replicas is unfortunately common, though at least it's compartmentalized via the getters Voters() and Learners(). The main problem solved in this commit is figuring out what should be returned from Voters(): is it all VoterX types, or only voters in one of the two majority configs part of a joint quorum? The useful answer is returning the set of voters corresponding to what the config will be once the joint state is exited; this happens to be what most callers care about. Incoming and full voters are really the same thing in our code; we just need to distinguish them from outgoing voters to correctly maintain the quorum sizes. Of course there are some callers that do care about quorum sizes, and a number of cleanups were made for them. This commit also adds a ReplicaDescriptors.ConfState helper which is then used in all of the places that were previously cobbling together a ConfState manually. Release note: None roachpb: add ReplicaType_Voter{Incoming,Outgoing} These are required for atomic replication changes to describe joint configurations, i.e. configurations consisting of two sets of replica which both need to reach quorum to make replication decisions. An audit of existing consumers of this enum will follow. Release note: None roachpb: rename ReplicaType variants The current naming is idiomatic for proto enums, but atypical for its usage in Go code. There is no `(gogoproto.customname)` that can fix this, and we're about to add more replica types that would require awkward names such as `roachpb.ReplicaType_VOTER_OUTGOING`. Switch to a Go-friendly naming scheme instead. Release note: None batcheval: generalize checkNotLearnerReplica This now errors out whenever the replica is not a voter, which is more robust as new replica types are introduced (which generally should not automatically become eligible to receive leases). Release note: None roachpb: improve RangeDescriptor.Validate Make sure there isn't more than one replica per store. Release note: None roachpb: generalize ReplicaDescriptor.String() The new code will generalize to new replica types. Release note: None [dnm] vendor: bump raft This picks up upstream fixes related to atomic membership changes. I had to smuggle in a small hack because we're picking up etcd-io/etcd#11037 which makes a race between the snapshot queue and the proactive learner snapshot much more likely, and this in turn makes tests quite flaky because it turns out that if the learner snap loses, it can actually error out. Release note: None storage: avoid fatal error from splitPostApply This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. Release note: None 40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany Addresses part of #39938. Release note (cli change): Default cluster locality topologies for multi-node cockroach demo clusters. Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Rohan Yadav <[email protected]>
raft/raft.go
Outdated
@@ -1607,16 +1606,21 @@ func (r *raft) abortLeaderTransfer() { | |||
// If the new entries would exceed the limit, the method returns false. If not, | |||
// the increase in uncommitted entry size is recorded and the method returns | |||
// true. | |||
// Configuration changes are never refused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of concerning ourselves with the size of the payload and trying to enforce this by allowing zero-byte payloads, could we just check the entry's type directly? We may need to change this condition to "Configuration changes that are alone are never refused".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it the case that we also want to allow the empty entry that new leaders propose and this seems like the cleanest way to do it? If that's the case then "Configuration changes are never refused" isn't accurate, right? What we actually want to say is that "entries without a payload are never refused".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in both cases it's the empty payload. I clarified the comment.
shall we push forward this PR? the issue mentioned in the PR needs to be fixed. |
@xiang90 yep and sorry for the radio silence. It will happen. |
The test was supposed to end earlier, but some old copy pasta survived.
The code doing so was undertested and buggy: it would launch multiple attempts to transition out when the conf change was not the last element in the log. This commit fixes the problem and adds a regression test. It also reworks the code to handle a former untested edge case, in which the auto-transition append is refused. This can't happen any more with the current version of the code because this proposal has size zero and is special cased in increaseUncommittedSize. Last but not least, the auto-leave proposal now also bumps pendingConfIndex, which was not done previously due to an oversight.
Apps typically maintain the raft configuration as part of the state machine. As a result, they want to be able to reject configuration change entries at apply time based on the state on which the entry is supposed to be applied. When this happens, the app should not call ApplyConfChange, but the comments did not make this clear. As a result, it was tempting to pass an empty pb.ConfChange or it's V2 version instead of not calling ApplyConfChange. However, an empty V1 or V2 proto aren't noops when the configuration is joint: an empty V1 change is treated internally as a single configuration change for NodeID zero and will cause a panic when applied in a joint state. An empty V2 proto is treated as a signal to leave a joint state, which means that the app's config and raft's would diverge. The comments updated in this commit now ask users to not call ApplyConfState when they reject a conf change. Apps that never use joint consensus can keep their old behavior since the distinction only matters when in a joint state, but we don't want to encourage that.
dd5843d
to
0544f33
Compare
Codecov Report
@@ Coverage Diff @@
## master #11046 +/- ##
==========================================
- Coverage 66.65% 65.78% -0.87%
==========================================
Files 401 401
Lines 36629 36627 -2
==========================================
- Hits 24414 24096 -318
- Misses 10715 11030 +315
- Partials 1500 1501 +1
Continue to review full report at Codecov.
|
The code doing so was undertested and buggy: it would launch multiple
attempts to transition out when the conf change was not the last element in
the log.
This commit fixes the problem and adds a regression test. It also reworks
the code to handle a former untested edge case, in which the
auto-transition append is refused. This can't happen any more with the
current version of the code because this proposal has size zero and is
special cased in increaseUncommittedSize. Last but not least, the
auto-leave proposal now also bumps pendingConfIndex, which was not done
previously due to an oversight.