Skip to content

Commit

Permalink
Merge #39936 #40221
Browse files Browse the repository at this point in the history
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg

This PR contains a series of commits that first pave for the way and ultimately
allow carrying out atomic replication changes via Raft joint consensus.

Atomic replication changes are required to avoid entering unsafe configurations
during lateral data movement. See #12768 for details; this is a problem we want
to address in 19.2.

Before merging this we'll need to sort out an upstream change in Raft which
has made a bug in our code related to learner snapshots much more likely; the
offending upstream commit is patched out of the vendored etcd bump in this PR
at the time of writing.

An antichronological listing of the individual commits follows. They should be
reviewed individually, though it may be helpful to look at the overall diff for
overall context. A modest amount of churn may exist between the commits, though
a good deal of effort went into avoiding this.

    storage: allow atomic replication changes in ChangeReplicas

    They default to OFF.

    This needs a lot more tests which will be added separately in the course of
    switching the default to ON and will focus on the interactions of joint
    states with everything else in the system.

    We'll also need another audit of consumers of the replica descriptors to
    make sure nothing was missed in the first pass.

    Release note: None

    storage: fix replicaGCQueue addition on removal trigger

    Once we enter joint changes, the replica to be removed will show up in
    `crt.Removed()` when the joint state is entered, but it only becomes
    eligible for actual removal when we leave the joint state later. The new
    code triggers at the right time, namely when the replica is no longer in
    the descriptor.

    Release note: None

    storage: let execChangeReplicasTxn construct the descriptor

    Prior to this commit, the method took both an old and a new desc *plus*
    slices of added and removed replicas. This had grown organically, wasn't an
    easily understood interface, led to repetitive and tricky code at the
    callers, and most importantly isn't adequate any more in a world with
    atomic replication changes, where execChangeReplicasTxn in constructing the
    ChangeReplicasTrigger is essentially deciding whether a joint configuration
    needs to be entered (which in turn determines what the descriptor needs to
    look like in the first place). To start solving this, let
    execChangeReplicasTxn create (and on success return) the new descriptor.
    Callers instead pass in what they want to be done, which is accomplished
    via an []internalReplicationChange slice.

    Release note: None

    roachpb: auto-assign ReplicaID during AddReplica

    This is a cleanup leading up to a larger refactor of the contract around
    `execChangeReplicasTxn`.

    Release note: None

    storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate

    This prepares the trigger -> raft translation code to properly handle
    atomic replication changes.

    This carries out a lot of validation to give us confidence that any unusual
    transitions would be caught quickly.

    This change also establishes more clearly which added and removed replicas
    are to be passed into the trigger when transitioning into a joint
    configuration. For example, when adding a voter, one technically replaces a
    Learner with a VoterIncoming and so the question is which type the replica
    in the `added` slice should have.  Picking the Learner would give the
    trigger the most power to validate the input, but it's annoying to have
    divergent descriptors floating around, so by convention we say that it is
    always the updated version of the descriptor (i.e. for fully removed
    replicas, just whatever it was before it disappeared). I spent more time on
    this than I'm willing to admit, in particular looking removing the
    redundancy here, but it made things more awkward than was worth it.

    Release note: None

    storage: push replication change unrolling into ChangeReplicas

    There are various callers to ChangeReplicas, so it makes more sense to
    unroll at that level. The code was updated to - in principle - do the right
    thing when atomic replication changes are requested, except that they are
    still unimplemented and a fatal error will serve as a reminder of that. Of
    course nothing issues them yet.

    Release note: None

    storage: skip ApplyConfChange on rejected entry

    When in a joint configuration, passing an empty conf change to
    ApplyConfChange doesn't do the right thing any more: it tells Raft that
    we're leaving the joint config. It's not a good idea to try to tell Raft
    anything about a ConfChange that got rejected. Raft internally knows that
    we handled it because it knows the applied index.

    This also adds a case match for ConfChangeV2 which is necessary to route
    atomic replication changes (ConfChangeV2).

    See etcd-io/etcd#11046

    Release note: None

    storage: un-embed decodedConfChange

    I ate a number of NPEs during development because nullable embedded fields
    are tricky; they hide the pointer derefs that often need a nil check. We'll
    embed the fields of decodedConfChange instead which works out better. This
    commit also adds the unmarshaling code necessary for ConfChangeV2 needed
    once we issue atomic replication changes.

    Release note: None

    storage: add learners one by one

    Doing more than one change at once is going to force us into an atomic
    replication change. This isn't crazy, but seems unnecessary at this point,
    so just add the learners one by one.

    Release note: None

    storage: add fatals where atomic conf changes are unsupported

    These will be upgraded with proper handling when atomic replication changes
    are actually introduced, but for now it's convenient to stub out some code
    that will need to handle them and to make sure we won't forget to do so
    later.

    Release note: None

    storage: add atomic replication changes cluster setting

    This defaults to false, and won't have an effect unless the newly
    introduced cluster version is also active.

    Release note: None

    roachpb: support zero-change ChangeReplicasTrigger

    We will use a ChangeReplicasTrigger without additions and removals when
    transitioning out of a joint configuration, so make sure it supports this
    properly.

    Release note: None

    roachpb: return "desired" voters from ReplicaDescriptors.Voters

    Previous commits introduced (yet unused) voter types to encode joint
    consensus configurations which occur during atomic replication changes.

    Access to the slice of replicas is unfortunately common, though at least
    it's compartmentalized via the getters Voters() and Learners().

    The main problem solved in this commit is figuring out what should be
    returned from Voters(): is it all VoterX types, or only voters in one of
    the two majority configs part of a joint quorum?

    The useful answer is returning the set of voters corresponding to what the
    config will be once the joint state is exited; this happens to be what most
    callers care about. Incoming and full voters are really the same thing in
    our code; we just need to distinguish them from outgoing voters to
    correctly maintain the quorum sizes.

    Of course there are some callers that do care about quorum sizes, and a
    number of cleanups were made for them.

    This commit also adds a ReplicaDescriptors.ConfState helper which is then
    used in all of the places that were previously cobbling together a
    ConfState manually.

    Release note: None

    roachpb: add ReplicaType_Voter{Incoming,Outgoing}

    These are required for atomic replication changes to describe joint
    configurations, i.e. configurations consisting of two sets of replica which
    both need to reach quorum to make replication decisions.

    An audit of existing consumers of this enum will follow.

    Release note: None

    roachpb: rename ReplicaType variants

    The current naming is idiomatic for proto enums, but atypical for its usage
    in Go code. There is no `(gogoproto.customname)` that can fix this, and
    we're about to add more replica types that would require awkward names such
    as `roachpb.ReplicaType_VOTER_OUTGOING`.

    Switch to a Go-friendly naming scheme instead.

    Release note: None

    batcheval: generalize checkNotLearnerReplica

    This now errors out whenever the replica is not a voter, which is more
    robust as new replica types are introduced (which generally should not
    automatically become eligible to receive leases).

    Release note: None

    roachpb: improve RangeDescriptor.Validate

    Make sure there isn't more than one replica per store.

    Release note: None

    roachpb: generalize ReplicaDescriptor.String()

    The new code will generalize to new replica types.

    Release note: None

    [dnm] vendor: bump raft

    This picks up upstream fixes related to atomic membership changes.

    I had to smuggle in a small hack because we're picking up
    etcd-io/etcd#11037 which makes a race between the
    snapshot queue and the proactive learner snapshot much more likely, and
    this in turn makes tests quite flaky because it turns out that if the
    learner snap loses, it can actually error out.

    Release note: None

    storage: avoid fatal error from splitPostApply

    This is the next band-aid on top of #39658 and #39571. The descriptor
    lookup I added sometimes fails because replicas can process a split trigger
    in which they're not a member of the range:

    > F190821 15:14:28.241623 312191 storage/store.go:2172
    > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
    > found in right hand side of split

    I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

    Release note: None

40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany

Addresses part of #39938.

Release note (cli change): Default cluster locality topologies for
multi-node cockroach demo clusters.

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
  • Loading branch information
3 people committed Aug 26, 2019
3 parents addd789 + 371e289 + 9cff5fd commit 7da3121
Show file tree
Hide file tree
Showing 43 changed files with 1,656 additions and 568 deletions.
7 changes: 4 additions & 3 deletions Gopkg.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion Gopkg.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,13 @@ ignored = [

[[constraint]]
name = "go.etcd.io/etcd"
branch = "master"
# We're stopping just shy of 4a2b4c8f7e0a3754fdd5a3341a27c2431c2a5385
# which picks up a fix to an inefficiency that at the time of writing
# triggers a bug:
# https://github.com/cockroachdb/cockroach/issues/40207
#
# branch = "master"
revision = "9b29151d3072511f574e7272a5348504086013fa"

# Used for the API client; we want the latest.
[[constraint]]
Expand Down
2 changes: 2 additions & 0 deletions c-deps/libroach/protos/roachpb/metadata.pb.cc

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 5 additions & 3 deletions c-deps/libroach/protos/roachpb/metadata.pb.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
<tr><td><code>kv.allocator.load_based_rebalancing</code></td><td>enumeration</td><td><code>leases and replicas</code></td><td>whether to rebalance based on the distribution of QPS across stores [off = 0, leases = 1, leases and replicas = 2]</td></tr>
<tr><td><code>kv.allocator.qps_rebalance_threshold</code></td><td>float</td><td><code>0.25</code></td><td>minimum fraction away from the mean a store's QPS (such as queries per second) can be before it is considered overfull or underfull</td></tr>
<tr><td><code>kv.allocator.range_rebalance_threshold</code></td><td>float</td><td><code>0.05</code></td><td>minimum fraction away from the mean a store's range count can be before it is considered overfull or underfull</td></tr>
<tr><td><code>kv.atomic_replication_changes.enabled</code></td><td>boolean</td><td><code>false</code></td><td>use atomic replication changes</td></tr>
<tr><td><code>kv.bulk_ingest.batch_size</code></td><td>byte size</td><td><code>16 MiB</code></td><td>the maximum size of the payload in an AddSSTable request</td></tr>
<tr><td><code>kv.bulk_ingest.buffer_increment</code></td><td>byte size</td><td><code>32 MiB</code></td><td>the size by which the BulkAdder attempts to grow its buffer before flushing</td></tr>
<tr><td><code>kv.bulk_ingest.index_buffer_size</code></td><td>byte size</td><td><code>32 MiB</code></td><td>the initial size of the BulkAdder buffer handling secondary index imports</td></tr>
Expand Down Expand Up @@ -129,6 +130,6 @@
<tr><td><code>trace.debug.enable</code></td><td>boolean</td><td><code>false</code></td><td>if set, traces for recent requests can be seen in the /debug page</td></tr>
<tr><td><code>trace.lightstep.token</code></td><td>string</td><td><code></code></td><td>if set, traces go to Lightstep using this token</td></tr>
<tr><td><code>trace.zipkin.collector</code></td><td>string</td><td><code></code></td><td>if set, traces go to the given Zipkin instance (example: '127.0.0.1:9411'); ignored if trace.lightstep.token is set</td></tr>
<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-8</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-9</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
</tbody>
</table>
21 changes: 21 additions & 0 deletions pkg/cli/demo.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/cli/cliflags"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/security"
"github.com/cockroachdb/cockroach/pkg/server"
"github.com/cockroachdb/cockroach/pkg/sql/sqlbase"
Expand Down Expand Up @@ -51,6 +52,20 @@ to avoid pre-loading a dataset.`,
const defaultGeneratorName = "movr"

var defaultGenerator workload.Generator
var defaultLocalities = []roachpb.Locality{
// Default localities for a 3 node cluster
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-east1"}, {Key: "az", Value: "b"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-east1"}, {Key: "az", Value: "c"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-east1"}, {Key: "az", Value: "d"}}},
// Default localities for a 6 node cluster
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-west1"}, {Key: "az", Value: "a"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-west1"}, {Key: "az", Value: "b"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "us-west1"}, {Key: "az", Value: "c"}}},
// Default localities for a 9 node cluster
{Tiers: []roachpb.Tier{{Key: "region", Value: "europe-west1"}, {Key: "az", Value: "b"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "europe-west1"}, {Key: "az", Value: "c"}}},
{Tiers: []roachpb.Tier{{Key: "region", Value: "europe-west1"}, {Key: "az", Value: "d"}}},
}

func init() {
for _, meta := range workload.Registered() {
Expand Down Expand Up @@ -90,12 +105,18 @@ func setupTransientServers(
return "", "", cleanup, errors.Errorf("must have a positive number of nodes")
}

// The user specified some localities for their nodes.
if len(demoCtx.localities) != 0 {
// Error out of localities don't line up with requested node
// count before doing any sort of setup.
if len(demoCtx.localities) != demoCtx.nodes {
return "", "", cleanup, errors.Errorf("number of localities specified must equal number of nodes")
}
} else {
demoCtx.localities = make([]roachpb.Locality, demoCtx.nodes)
for i := 0; i < demoCtx.nodes; i++ {
demoCtx.localities[i] = defaultLocalities[i%len(defaultLocalities)]
}
}

// Set up logging. For demo/transient server we use non-standard
Expand Down
25 changes: 10 additions & 15 deletions pkg/kv/dist_sender_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -310,9 +310,10 @@ func TestSendRPCOrder(t *testing.T) {
}

descriptor := roachpb.RangeDescriptor{
StartKey: roachpb.RKeyMin,
EndKey: roachpb.RKeyMax,
RangeID: rangeID,
StartKey: roachpb.RKeyMin,
EndKey: roachpb.RKeyMax,
RangeID: rangeID,
NextReplicaID: 1,
}
for i := int32(1); i <= 5; i++ {
addr := util.MakeUnresolvedAddr("tcp", fmt.Sprintf("node%d:1", i))
Expand All @@ -326,10 +327,7 @@ func TestSendRPCOrder(t *testing.T) {
if err := g.AddInfoProto(gossip.MakeNodeIDKey(roachpb.NodeID(i)), nd, time.Hour); err != nil {
t.Fatal(err)
}
descriptor.AddReplica(roachpb.ReplicaDescriptor{
NodeID: roachpb.NodeID(i),
StoreID: roachpb.StoreID(i),
})
descriptor.AddReplica(roachpb.NodeID(i), roachpb.StoreID(i), roachpb.VOTER_FULL)
}

// Stub to be changed in each test case.
Expand Down Expand Up @@ -1416,9 +1414,10 @@ func TestSendRPCRangeNotFoundError(t *testing.T) {

// Fill RangeDescriptor with three replicas.
var descriptor = roachpb.RangeDescriptor{
RangeID: 1,
StartKey: roachpb.RKey("a"),
EndKey: roachpb.RKey("z"),
RangeID: 1,
StartKey: roachpb.RKey("a"),
EndKey: roachpb.RKey("z"),
NextReplicaID: 1,
}
for i := 1; i <= 3; i++ {
addr := util.MakeUnresolvedAddr("tcp", fmt.Sprintf("node%d", i))
Expand All @@ -1430,11 +1429,7 @@ func TestSendRPCRangeNotFoundError(t *testing.T) {
t.Fatal(err)
}

descriptor.AddReplica(roachpb.ReplicaDescriptor{
NodeID: roachpb.NodeID(i),
StoreID: roachpb.StoreID(i),
ReplicaID: roachpb.ReplicaID(i),
})
descriptor.AddReplica(roachpb.NodeID(i), roachpb.StoreID(i), roachpb.VOTER_FULL)
}
descDB := mockRangeDescriptorDBForDescs(
testMetaRangeDescriptor,
Expand Down
Loading

0 comments on commit 7da3121

Please sign in to comment.