Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[poc,dnm]: storage: use learner replicas #35787

Closed
wants to merge 1 commit into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 15, 2019

This is a PR just to show some code to the interested parties. The real thing
to look at here is the explanation and suggested strategy below. Don't
review the code.


Learner replicas are full Replicas minus the right to vote (i.e
they don't count for quorum). They are interesting to us because they
allow us to phase out preemptive snapshots (via delayed preemptive
snaps
as a migration strategy; see there for all the things we don't
like about preemptive snapshots), and because they can help us avoid
spending more time in vulnerable configurations than we need to.

To see an example of the latter, assume we're trying to upreplicate from
three replicas to five replicas. As is, we need to add a fourth replica,
wait for a preemptive snapshot, and add the fifth. We spend
approximately the duration of one preemptive snapshot in an even replica
configuration. In theory, we could send two preemptive snapshots first
and then carry out two replica additions, but with learners, it would be
much more straightforward and less error-prone. This doesn't solve the
problems in #12768, but it helps avoid them.

This PR shows the bare minimum of code changes to upreplicate using
learner replicas and suggests further steps to make them a reality.

Principally, to use learner replicas, we barely need to make any
changes. Our current up-replication code is this:

  1. send preemptive snapshot
  2. run the ChangeReplicas txn, which adds a ReplicaDescriptor to the
    replicated range descriptor and, on commit, induces a Raft configuration
    change (raftpb.ConfChangeAddNode).

The new up-replication code (note that the old one has to stick around,
because compatibility):

  1. run the ChangeReplicas txn, which adds a ReplicaDescriptor to the
    replicated range descriptor with the Learner flag set to true and, on commit, induces a Raft configuration
    change (raftpb.ConfChangeAddLearnerNode).
  2. wait for the learner to have caught up or send it a Raft snapshot
    proactively (either works, just have to make sure not to duplicate work)
  3. run a ChangeReplicas txn which removes the Learner flag from the ReplicaDescriptor
    and induces a Raft conf change of raftpb.ConfChangeAddNode (upgrading
    the learner).

The existence of learners will need updates throughout the allocator so
that it realizes that they don't count for quorum and are either
upgraded or removed in a timely manner. None of that is in this POC.

Release note: None

This is a PR just to show some code to the interested parties. The real thing
to look at here is the explanation and suggested strategy below. Don't
review the code.

----

Learner replicas are full Replicas minus the right to vote (i.e
they don't count for quorum). They are interesting to us because they
allow us to phase out preemptive snapshots (via [delayed preemptive
snaps] as a migration strategy; see there for all the things we don't
like about preemptive snapshots), and because they can help us avoid
spending more time in vulnerable configurations than we need to.

To see an example of the latter, assume we're trying to upreplicate from
three replicas to five replicas. As is, we need to add a fourth replica,
wait for a preemptive snapshot, and add the fifth. We spend
approximately the duration of one preemptive snapshot in an even replica
configuration. In theory, we could send two preemptive snapshots first
and then carry out two replica additions, but with learners, it would be
much more straightforward and less error-prone. This doesn't solve the
problems in cockroachdb#12768, but it helps avoid them.

This PR shows the bare minimum of code changes to upreplicate using
learner replicas and suggests further steps to make them a reality.

Principally, to use learner replicas, we barely need to make any
changes. Our current up-replication code is this:

1. send preemptive snapshot
1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the
replicated range descriptor and, on commit, induces a Raft configuration
change (`raftpb.ConfChangeAddNode`).

The new up-replication code (note that the old one has to stick around,
because compatibility):

1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the
replicated range descriptor with the `Learner` flag set to true and, on commit, induces a Raft configuration
change (`raftpb.ConfChangeAddLearnerNode`).
2. wait for the learner to have caught up or send it a Raft snapshot
proactively (either works, just have to make sure not to duplicate work)
3. run a ChangeReplicas txn which removes the `Learner` flag from the `ReplicaDescriptor`
and induces a Raft conf change of `raftpb.ConfChangeAddNode` (upgrading
the learner).

The existence of learners will need updates throughout the allocator so
that it realizes that they don't count for quorum and are either
upgraded or removed in a timely manner. None of that is in this POC.

[delayed preemptive snaps]: cockroachdb#35786

Release note: None
@tbg tbg requested a review from a team March 15, 2019 13:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Apr 5, 2019
Preemptive snapshots are sent to a Store (by another Store) as part of
the process of adding a new Replica to a Range. The sequence of events
is:

- send a preemptive snapshot (replicaID=0) to the target
- target creates a Replica from the preemptive snapshot (replicaID=0)
- allocate new replicaID and add the target officially under that
replicaID
- success (replicaID=nonzero)

They are problematic for a variety of reasons:

1. they introduce a Replica state, namely that of Replicas that have
   data but don't have a replicaID. Such replicas can't serve traffic
   and can't even have an initialized Raft group, so they're barely
   Replicas at all. Every bit of code in Replica needs to know about
   that.
2. the above state is implemented in an ad-hoc fashion and adds
   significantly to the complexity of the Store/Replica codebase.
3. Preemptive snapshots are subject to accidental garbage collection.
   There's currently no mechanism to decide whether a preemptive
   snapshot is simply waiting to be upgraded or whether it's abandoned.
   Accidental deletion causes another snapshot (this time Raft) to be
   sent.
4. Adding to 1., there are transitions between regular Replicas and
   preemptive snapshots that add additional complexity. For example,
   a regular snapshot can apply on top of a preemptive snapshot and
   vice versa. We try to prevent some of them but there are technical
   problems.
5. Preemptive snapshots have a range descriptor that doesn't include
   the Replica created from them. This is another gotcha that code
   needs to be aware of. (we cannot fix this in the first iteration,
   but it will be fixed when [learner replicas] are standard)

Our answer to all but the last of these problems is that we want to
remove the concept of preemptive snapshots altogether and instead rely
on [learner replicas]. This is a Raft concept denoting essentially a
member of a replication group without a vote. By replacing the
preemptive snapshot with the addition of a learner replica (before
upgrading to a full voting member), preemptive snapshots are replaced by
full replicas with a flag set.

However, as often the case, the interesting question becomes that
of the migration, or, the possibility of running a mixed version
cluster in which one node knows about these changes and another
doesn't. The basic requirement that falls out of this is that we
have to be able to send preemptive snapshots to followers even
using the new code, and we have to be able to receive preemptive
snapshots using the new code (though that code will go cold once
the cluster setting upgrade has happened).

Fortunately, sending and receiving preemptive snapshots is not what
makes them terrible. In fact, the code that creates and receives
preemptive snapshots is 100% shared with that for Raft snapshots. The
complexity surrounding preemptive snapshots come from what happens when
they are used to create a Replica object too early, but this is an
implementation detail not visible across RPC boundaries.

This suggests investigating how we can receive preemptive snapshots
without actually using any of the internal code that handles them, so
that this code can be removed in 19.2. The basic idea is that we will
write the preemptive snapshot to a temporary location (instead of
creating a Replica from it, and apply it as a Raft snapshot the moment
we observe a local Replica for the matching RangeID created as a full
member of the Raft group (i.e. with nonzero replicaID).

This is carried out in this PR. Preemptive snapshots are put into a
temporary in-memory map the size of which we aggressively keep under
control (and which is cleared out periodically). Replica objects with
replicaID zero are no longer instantiated.

See the companion POC [learner replicas] which doesn't bother about the
migration but explores actually using learner replicas. When learner
replicas are standard, 5. above is also mostly addressed: the replica
will always be contained in its range descriptor, even though it may be
as a learner.

TODO(tbg): preemptive snapshots stored on disk before this PR need to
be deleted before we instantiate a Replica from them (because after
this PR that will fail).

[learner replicas]: cockroachdb#35787
[SST snapshots]: cockroachdb#25134

Release note: None
@tbg
Copy link
Member Author

tbg commented May 7, 2019

@danhhz another thing to think about here is the quota pool -- how should it interact with learners? The argument against blocking progress when there's a slow learner is that the learner shouldn't be able to impair the foreground traffic, especially not as it's coming online. The argument for doing so is that once the learner gets added to the group it will, and so it's silly to not have it be the case earlier.

I think the learner needs to be in sort of a catch-up phase -- if it's waiting for its initial snapshot it isn't taken into account for the quota pool, but once it has received a snapshot it will be. I think that's straightforward to implement (the quota pool tracks the max acked index from each follower, so when that's not zero, the learner is active).

@tbg
Copy link
Member Author

tbg commented May 7, 2019

Also, log truncation. Here I think the learner should be treated like an ordinary follower, which means that it may delay log truncation. This is fine, our past reasons for being very aggressive about log truncation (avoiding large snapshots) doesn't hold any more since we stopped sending the raft log as part of snaps.

@danhhz danhhz self-assigned this May 7, 2019
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@tbg
Copy link
Member Author

tbg commented Aug 13, 2019

Obsolete.

@tbg tbg closed this Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants