Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg This PR contains a series of commits that first pave for the way and ultimately allow carrying out atomic replication changes via Raft joint consensus. Atomic replication changes are required to avoid entering unsafe configurations during lateral data movement. See #12768 for details; this is a problem we want to address in 19.2. Before merging this we'll need to sort out an upstream change in Raft which has made a bug in our code related to learner snapshots much more likely; the offending upstream commit is patched out of the vendored etcd bump in this PR at the time of writing. An antichronological listing of the individual commits follows. They should be reviewed individually, though it may be helpful to look at the overall diff for overall context. A modest amount of churn may exist between the commits, though a good deal of effort went into avoiding this. storage: allow atomic replication changes in ChangeReplicas They default to OFF. This needs a lot more tests which will be added separately in the course of switching the default to ON and will focus on the interactions of joint states with everything else in the system. We'll also need another audit of consumers of the replica descriptors to make sure nothing was missed in the first pass. Release note: None storage: fix replicaGCQueue addition on removal trigger Once we enter joint changes, the replica to be removed will show up in `crt.Removed()` when the joint state is entered, but it only becomes eligible for actual removal when we leave the joint state later. The new code triggers at the right time, namely when the replica is no longer in the descriptor. Release note: None storage: let execChangeReplicasTxn construct the descriptor Prior to this commit, the method took both an old and a new desc *plus* slices of added and removed replicas. This had grown organically, wasn't an easily understood interface, led to repetitive and tricky code at the callers, and most importantly isn't adequate any more in a world with atomic replication changes, where execChangeReplicasTxn in constructing the ChangeReplicasTrigger is essentially deciding whether a joint configuration needs to be entered (which in turn determines what the descriptor needs to look like in the first place). To start solving this, let execChangeReplicasTxn create (and on success return) the new descriptor. Callers instead pass in what they want to be done, which is accomplished via an []internalReplicationChange slice. Release note: None roachpb: auto-assign ReplicaID during AddReplica This is a cleanup leading up to a larger refactor of the contract around `execChangeReplicasTxn`. Release note: None storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate This prepares the trigger -> raft translation code to properly handle atomic replication changes. This carries out a lot of validation to give us confidence that any unusual transitions would be caught quickly. This change also establishes more clearly which added and removed replicas are to be passed into the trigger when transitioning into a joint configuration. For example, when adding a voter, one technically replaces a Learner with a VoterIncoming and so the question is which type the replica in the `added` slice should have. Picking the Learner would give the trigger the most power to validate the input, but it's annoying to have divergent descriptors floating around, so by convention we say that it is always the updated version of the descriptor (i.e. for fully removed replicas, just whatever it was before it disappeared). I spent more time on this than I'm willing to admit, in particular looking removing the redundancy here, but it made things more awkward than was worth it. Release note: None storage: push replication change unrolling into ChangeReplicas There are various callers to ChangeReplicas, so it makes more sense to unroll at that level. The code was updated to - in principle - do the right thing when atomic replication changes are requested, except that they are still unimplemented and a fatal error will serve as a reminder of that. Of course nothing issues them yet. Release note: None storage: skip ApplyConfChange on rejected entry When in a joint configuration, passing an empty conf change to ApplyConfChange doesn't do the right thing any more: it tells Raft that we're leaving the joint config. It's not a good idea to try to tell Raft anything about a ConfChange that got rejected. Raft internally knows that we handled it because it knows the applied index. This also adds a case match for ConfChangeV2 which is necessary to route atomic replication changes (ConfChangeV2). See etcd-io/etcd#11046 Release note: None storage: un-embed decodedConfChange I ate a number of NPEs during development because nullable embedded fields are tricky; they hide the pointer derefs that often need a nil check. We'll embed the fields of decodedConfChange instead which works out better. This commit also adds the unmarshaling code necessary for ConfChangeV2 needed once we issue atomic replication changes. Release note: None storage: add learners one by one Doing more than one change at once is going to force us into an atomic replication change. This isn't crazy, but seems unnecessary at this point, so just add the learners one by one. Release note: None storage: add fatals where atomic conf changes are unsupported These will be upgraded with proper handling when atomic replication changes are actually introduced, but for now it's convenient to stub out some code that will need to handle them and to make sure we won't forget to do so later. Release note: None storage: add atomic replication changes cluster setting This defaults to false, and won't have an effect unless the newly introduced cluster version is also active. Release note: None roachpb: support zero-change ChangeReplicasTrigger We will use a ChangeReplicasTrigger without additions and removals when transitioning out of a joint configuration, so make sure it supports this properly. Release note: None roachpb: return "desired" voters from ReplicaDescriptors.Voters Previous commits introduced (yet unused) voter types to encode joint consensus configurations which occur during atomic replication changes. Access to the slice of replicas is unfortunately common, though at least it's compartmentalized via the getters Voters() and Learners(). The main problem solved in this commit is figuring out what should be returned from Voters(): is it all VoterX types, or only voters in one of the two majority configs part of a joint quorum? The useful answer is returning the set of voters corresponding to what the config will be once the joint state is exited; this happens to be what most callers care about. Incoming and full voters are really the same thing in our code; we just need to distinguish them from outgoing voters to correctly maintain the quorum sizes. Of course there are some callers that do care about quorum sizes, and a number of cleanups were made for them. This commit also adds a ReplicaDescriptors.ConfState helper which is then used in all of the places that were previously cobbling together a ConfState manually. Release note: None roachpb: add ReplicaType_Voter{Incoming,Outgoing} These are required for atomic replication changes to describe joint configurations, i.e. configurations consisting of two sets of replica which both need to reach quorum to make replication decisions. An audit of existing consumers of this enum will follow. Release note: None roachpb: rename ReplicaType variants The current naming is idiomatic for proto enums, but atypical for its usage in Go code. There is no `(gogoproto.customname)` that can fix this, and we're about to add more replica types that would require awkward names such as `roachpb.ReplicaType_VOTER_OUTGOING`. Switch to a Go-friendly naming scheme instead. Release note: None batcheval: generalize checkNotLearnerReplica This now errors out whenever the replica is not a voter, which is more robust as new replica types are introduced (which generally should not automatically become eligible to receive leases). Release note: None roachpb: improve RangeDescriptor.Validate Make sure there isn't more than one replica per store. Release note: None roachpb: generalize ReplicaDescriptor.String() The new code will generalize to new replica types. Release note: None [dnm] vendor: bump raft This picks up upstream fixes related to atomic membership changes. I had to smuggle in a small hack because we're picking up etcd-io/etcd#11037 which makes a race between the snapshot queue and the proactive learner snapshot much more likely, and this in turn makes tests quite flaky because it turns out that if the learner snap loses, it can actually error out. Release note: None storage: avoid fatal error from splitPostApply This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. Release note: None 40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany Addresses part of #39938. Release note (cli change): Default cluster locality topologies for multi-node cockroach demo clusters. Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Rohan Yadav <[email protected]>
- Loading branch information