Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: proactively probe newly added followers #11037

Merged
merged 8 commits into from
Aug 16, 2019
Merged

Conversation

tbg
Copy link
Contributor

@tbg tbg commented Aug 15, 2019

When the leader applied a new configuration that added voters, it would not
immediately probe these voters, delaying when they would be caught up.

I noticed this while writing the interaction-driven tests included in this PR.
Various fixes to the infrastructure were also made.

I also added a _breakpoint directive. This is a helper case to attach a
debugger to when a problem needs to be investigated in a longer test file. In
such a case, add the following stanza immediately before the interesting
behavior starts:

_breakpoint:
---- ok

and set a breakpoint on the _breakpoint case.

tbg added 4 commits August 14, 2019 17:24
It was bailing out too early.
Initializing at LastIndex+1 meant that new peers would not be probed
immediately when they appeared in the leader's config, which delays
their getting caught up.
It is a helper case to attach a debugger to when a problem needs
to be investigated in a longer test file. In such a case, add the
following stanza immediately before the interesting behavior starts:

_breakpoint:
----
ok

and set a breakpoint on the _breakpoint case.
When the leader applied a new configuration that added voters, it would
not immediately probe these voters, delaying when they would be caught
up.

I noticed this while writing an interaction-driven test, which has now
been cleaned up and completed.
@tbg tbg requested a review from bdarnell August 15, 2019 10:40
@tbg
Copy link
Contributor Author

tbg commented Aug 15, 2019

@nvanbenschoten I hear @bdarnell is on vacation, would you mind taking a look at this?

@tbg
Copy link
Contributor Author

tbg commented Aug 15, 2019

I filed #11038 about the leadership removal bug.

INFO 2 [logterm: 1, index: 4] sent MsgVote request to 3 at term 2

# n2 is now campaigning while n1 is down (does not respond). The latest config
# has n1 as a voter, but n1 doesn't even have the corresponding conf change in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has n1 as a voter

Should this be n3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

----
ok (quiet)

propose-conf-change 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here. I spent longer than I'd like to admit wondering where n3 was becoming a voter.

I don't know whether there's necessarily anything to do, but it is very easy to get the terms v1 and v2 confused between their meanings as voters 1 and 2 and conf change versions 1 and 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should revisit the conf change notation, I agree that this is confusing. Will leave it alone in this PR though.

INFO 2 [logterm: 1, index: 4] sent MsgVote request to 3 at term 2

# n2 is now campaigning while n1 is down (does not respond). The latest config
# has n1 as a voter, but n1 doesn't even have the corresponding conf change in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but n1 doesn't even have the corresponding conf change in its log

Doesn't it? Don't we see that with:

1->3 MsgApp Term:1 Log:1/3 Commit:3 Entries:[1/4 EntryConfChangeV2 v3]
1->3 MsgApp Term:1 Log:1/4 Commit:4
2->3 MsgVote Term:2 Log:1/4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 I needed to explicitly drop the messages before. Fixed now along with some more cleanup, and I restructured the test so that we see the vote response clearly in its own block.

@xiang90
Copy link
Contributor

xiang90 commented Aug 15, 2019

The idea sounds reasonable. leave to @nvanbenschoten for the actual review :P.

tbg added 4 commits August 16, 2019 09:38
Verifiy the behavior in various v1 and v2 conf change operations.
This also includes various fixups, notably it adds protection
against transitioning in and out of new configs when this is not
permissible.

There are more threads to pull, but those are left for future commits.
When a leader removes itself, it will retain its leadership but not
accept new proposals, making the range effectively stuck until manual
intervention triggers a campaign event.

This commit documents the behavior. It does not correct it yet.
It was confusing to see the effects of the Ready (i.e. log messages)
printed before the Ready itself.
Copy link
Contributor Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nvanbenschoten.

----
ok (quiet)

propose-conf-change 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done throughout.

----
ok (quiet)

propose-conf-change 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should revisit the conf change notation, I agree that this is confusing. Will leave it alone in this PR though.

INFO 2 [logterm: 1, index: 4] sent MsgVote request to 3 at term 2

# n2 is now campaigning while n1 is down (does not respond). The latest config
# has n1 as a voter, but n1 doesn't even have the corresponding conf change in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

INFO 2 [logterm: 1, index: 4] sent MsgVote request to 3 at term 2

# n2 is now campaigning while n1 is down (does not respond). The latest config
# has n1 as a voter, but n1 doesn't even have the corresponding conf change in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 I needed to explicitly drop the messages before. Fixed now along with some more cleanup, and I restructured the test so that we see the vote response clearly in its own block.

@tbg tbg merged commit 4a2b4c8 into etcd-io:master Aug 16, 2019
@tbg tbg deleted the interactive branch August 16, 2019 08:24
tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2019
This picks up upstream fixes related to atomic membership changes.

I had to smuggle in a small hack because we're picking up
etcd-io/etcd#11037 which makes a race between
the snapshot queue and the proactive learner snapshot much more likely,
and this in turn makes tests quite flaky because it turns out that if
the learner snap loses, it can actually error out.

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Aug 26, 2019
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg

This PR contains a series of commits that first pave for the way and ultimately
allow carrying out atomic replication changes via Raft joint consensus.

Atomic replication changes are required to avoid entering unsafe configurations
during lateral data movement. See #12768 for details; this is a problem we want
to address in 19.2.

Before merging this we'll need to sort out an upstream change in Raft which
has made a bug in our code related to learner snapshots much more likely; the
offending upstream commit is patched out of the vendored etcd bump in this PR
at the time of writing.

An antichronological listing of the individual commits follows. They should be
reviewed individually, though it may be helpful to look at the overall diff for
overall context. A modest amount of churn may exist between the commits, though
a good deal of effort went into avoiding this.

    storage: allow atomic replication changes in ChangeReplicas

    They default to OFF.

    This needs a lot more tests which will be added separately in the course of
    switching the default to ON and will focus on the interactions of joint
    states with everything else in the system.

    We'll also need another audit of consumers of the replica descriptors to
    make sure nothing was missed in the first pass.

    Release note: None

    storage: fix replicaGCQueue addition on removal trigger

    Once we enter joint changes, the replica to be removed will show up in
    `crt.Removed()` when the joint state is entered, but it only becomes
    eligible for actual removal when we leave the joint state later. The new
    code triggers at the right time, namely when the replica is no longer in
    the descriptor.

    Release note: None

    storage: let execChangeReplicasTxn construct the descriptor

    Prior to this commit, the method took both an old and a new desc *plus*
    slices of added and removed replicas. This had grown organically, wasn't an
    easily understood interface, led to repetitive and tricky code at the
    callers, and most importantly isn't adequate any more in a world with
    atomic replication changes, where execChangeReplicasTxn in constructing the
    ChangeReplicasTrigger is essentially deciding whether a joint configuration
    needs to be entered (which in turn determines what the descriptor needs to
    look like in the first place). To start solving this, let
    execChangeReplicasTxn create (and on success return) the new descriptor.
    Callers instead pass in what they want to be done, which is accomplished
    via an []internalReplicationChange slice.

    Release note: None

    roachpb: auto-assign ReplicaID during AddReplica

    This is a cleanup leading up to a larger refactor of the contract around
    `execChangeReplicasTxn`.

    Release note: None

    storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate

    This prepares the trigger -> raft translation code to properly handle
    atomic replication changes.

    This carries out a lot of validation to give us confidence that any unusual
    transitions would be caught quickly.

    This change also establishes more clearly which added and removed replicas
    are to be passed into the trigger when transitioning into a joint
    configuration. For example, when adding a voter, one technically replaces a
    Learner with a VoterIncoming and so the question is which type the replica
    in the `added` slice should have.  Picking the Learner would give the
    trigger the most power to validate the input, but it's annoying to have
    divergent descriptors floating around, so by convention we say that it is
    always the updated version of the descriptor (i.e. for fully removed
    replicas, just whatever it was before it disappeared). I spent more time on
    this than I'm willing to admit, in particular looking removing the
    redundancy here, but it made things more awkward than was worth it.

    Release note: None

    storage: push replication change unrolling into ChangeReplicas

    There are various callers to ChangeReplicas, so it makes more sense to
    unroll at that level. The code was updated to - in principle - do the right
    thing when atomic replication changes are requested, except that they are
    still unimplemented and a fatal error will serve as a reminder of that. Of
    course nothing issues them yet.

    Release note: None

    storage: skip ApplyConfChange on rejected entry

    When in a joint configuration, passing an empty conf change to
    ApplyConfChange doesn't do the right thing any more: it tells Raft that
    we're leaving the joint config. It's not a good idea to try to tell Raft
    anything about a ConfChange that got rejected. Raft internally knows that
    we handled it because it knows the applied index.

    This also adds a case match for ConfChangeV2 which is necessary to route
    atomic replication changes (ConfChangeV2).

    See etcd-io/etcd#11046

    Release note: None

    storage: un-embed decodedConfChange

    I ate a number of NPEs during development because nullable embedded fields
    are tricky; they hide the pointer derefs that often need a nil check. We'll
    embed the fields of decodedConfChange instead which works out better. This
    commit also adds the unmarshaling code necessary for ConfChangeV2 needed
    once we issue atomic replication changes.

    Release note: None

    storage: add learners one by one

    Doing more than one change at once is going to force us into an atomic
    replication change. This isn't crazy, but seems unnecessary at this point,
    so just add the learners one by one.

    Release note: None

    storage: add fatals where atomic conf changes are unsupported

    These will be upgraded with proper handling when atomic replication changes
    are actually introduced, but for now it's convenient to stub out some code
    that will need to handle them and to make sure we won't forget to do so
    later.

    Release note: None

    storage: add atomic replication changes cluster setting

    This defaults to false, and won't have an effect unless the newly
    introduced cluster version is also active.

    Release note: None

    roachpb: support zero-change ChangeReplicasTrigger

    We will use a ChangeReplicasTrigger without additions and removals when
    transitioning out of a joint configuration, so make sure it supports this
    properly.

    Release note: None

    roachpb: return "desired" voters from ReplicaDescriptors.Voters

    Previous commits introduced (yet unused) voter types to encode joint
    consensus configurations which occur during atomic replication changes.

    Access to the slice of replicas is unfortunately common, though at least
    it's compartmentalized via the getters Voters() and Learners().

    The main problem solved in this commit is figuring out what should be
    returned from Voters(): is it all VoterX types, or only voters in one of
    the two majority configs part of a joint quorum?

    The useful answer is returning the set of voters corresponding to what the
    config will be once the joint state is exited; this happens to be what most
    callers care about. Incoming and full voters are really the same thing in
    our code; we just need to distinguish them from outgoing voters to
    correctly maintain the quorum sizes.

    Of course there are some callers that do care about quorum sizes, and a
    number of cleanups were made for them.

    This commit also adds a ReplicaDescriptors.ConfState helper which is then
    used in all of the places that were previously cobbling together a
    ConfState manually.

    Release note: None

    roachpb: add ReplicaType_Voter{Incoming,Outgoing}

    These are required for atomic replication changes to describe joint
    configurations, i.e. configurations consisting of two sets of replica which
    both need to reach quorum to make replication decisions.

    An audit of existing consumers of this enum will follow.

    Release note: None

    roachpb: rename ReplicaType variants

    The current naming is idiomatic for proto enums, but atypical for its usage
    in Go code. There is no `(gogoproto.customname)` that can fix this, and
    we're about to add more replica types that would require awkward names such
    as `roachpb.ReplicaType_VOTER_OUTGOING`.

    Switch to a Go-friendly naming scheme instead.

    Release note: None

    batcheval: generalize checkNotLearnerReplica

    This now errors out whenever the replica is not a voter, which is more
    robust as new replica types are introduced (which generally should not
    automatically become eligible to receive leases).

    Release note: None

    roachpb: improve RangeDescriptor.Validate

    Make sure there isn't more than one replica per store.

    Release note: None

    roachpb: generalize ReplicaDescriptor.String()

    The new code will generalize to new replica types.

    Release note: None

    [dnm] vendor: bump raft

    This picks up upstream fixes related to atomic membership changes.

    I had to smuggle in a small hack because we're picking up
    etcd-io/etcd#11037 which makes a race between the
    snapshot queue and the proactive learner snapshot much more likely, and
    this in turn makes tests quite flaky because it turns out that if the
    learner snap loses, it can actually error out.

    Release note: None

    storage: avoid fatal error from splitPostApply

    This is the next band-aid on top of #39658 and #39571. The descriptor
    lookup I added sometimes fails because replicas can process a split trigger
    in which they're not a member of the range:

    > F190821 15:14:28.241623 312191 storage/store.go:2172
    > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
    > found in right hand side of split

    I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

    Release note: None

40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany

Addresses part of #39938.

Release note (cli change): Default cluster locality topologies for
multi-node cockroach demo clusters.

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants