Skip to content

Commit

Permalink
storage: use raft learners in replica addition, defaulted off
Browse files Browse the repository at this point in the history
A learner is a participant in a raft group that accepts messages but doesn't
vote. This means it doesn't affect raft quorum and thus doesn't affect
the fragility of the range, even if it's very far behind or many
learners are down.

At the time of writing, learners are used in CockroachDB as an interim
state while adding a replica. A learner replica is added to the range
via raft ConfChange, a raft snapshot (of type LEARNER) is sent to catch
it up, and then a second ConfChange promotes it to a full replica.

This means that learners are currently always expected to have a short
lifetime, approximately the time it takes to send a snapshot. Ideas have
been kicked around to use learners with follower reads, which could be a
cheap way to allow many geographies to have local reads without
affecting write latencies. If implemented, these learners would have
long lifetimes.

For simplicity, CockroachDB treats learner replicas the same as voter
replicas as much as possible, but there are a few exceptions:

- Learner replicas are not considered when calculating quorum size, and
  thus do not affect the computation of which ranges are
  under-replicated for upreplication/alerting/debug/etc purposes. Ditto
  for over-replicated.
- Learner replicas cannot become raft leaders, so we also don't allow
  them to become leaseholders. As a result, DistSender and the various
  oracles don't try to send them traffic.
- The raft snapshot queue does not send snapshots to learners for
  reasons described below.
- Merges won't run while a learner replica is present.

Replicas are now added in two ConfChange transactions. The first creates
the learner and the second promotes it to a voter. If the node that is
coordinating this dies in the middle, we're left with an orphaned
learner. For this reason, the replicate queue always first removes any
learners it sees before doing anything else. We could instead try to
finish off the learner snapshot and promotion, but this is more
complicated and it's not yet clear the efficiency win is worth it.

This introduces some rare races between the replicate queue and
AdminChangeReplicas or if a range's lease is moved to a new owner while
the old leaseholder is still processing it in the replicate queue. These
races are handled by retrying if a learner disappears during the
snapshot/promotion.

If the coordinator otherwise encounters an error while sending the
learner snapshot or promoting it (which can happen for a number of
reasons, including the node getting the learner going away), it tries to
clean up after itself by rolling back the addition of the learner.

There is another race between the learner snapshot being sent and the
raft snapshot queue happening to check the replica at the same time,
also sending it a snapshot. This is safe but wasteful, so the raft
snapshot queue won't try to send snapshots to learners.

Merges are blocked if either side has a learner (to avoid working out
the edge cases) but it's historically turned out to be a bad idea to get
in the way of splits, so we allow them even when some of the replicas
are learners. This orphans a learner on each side of the split (the
original coordinator will not be able to finish either of them), but the
replication queue will eventually clean them up.

Learner replicas don't affect quorum but they do affect the system in
other ways. The most obvious way is that the leader sends them the raft
traffic it would send to any follower, consuming resources. More
surprising is that once the learner has received a snapshot, it's
considered by the quota pool that prevents the raft leader from getting
too far ahead of the followers. This is because a learner (especially
one that already has a snapshot) is expected to very soon be a voter, so
we treat it like one. However, it means a slow learner can slow down
regular traffic, which is possibly counterintuitive.

Release note (general change): Replicas are now added using a raft
learner and going through the normal raft snapshot process to catch them
up, eliminating technical debt. No user facing changes are expected.
  • Loading branch information
danhhz committed Jul 15, 2019
1 parent 4857fa5 commit 2f483d3
Show file tree
Hide file tree
Showing 27 changed files with 1,189 additions and 105 deletions.
3 changes: 2 additions & 1 deletion docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
<tr><td><code>kv.closed_timestamp.target_duration</code></td><td>duration</td><td><code>30s</code></td><td>if nonzero, attempt to provide closed timestamp notifications for timestamps trailing cluster time by approximately this duration</td></tr>
<tr><td><code>kv.follower_read.target_multiple</code></td><td>float</td><td><code>3</code></td><td>if above 1, encourages the distsender to perform a read against the closest replica if a request is older than kv.closed_timestamp.target_duration * (1 + kv.closed_timestamp.close_fraction * this) less a clock uncertainty interval. This value also is used to create follower_timestamp(). (WARNING: may compromise cluster stability or correctness; do not edit without supervision)</td></tr>
<tr><td><code>kv.import.batch_size</code></td><td>byte size</td><td><code>32 MiB</code></td><td>the maximum size of the payload in an AddSSTable request (WARNING: may compromise cluster stability or correctness; do not edit without supervision)</td></tr>
<tr><td><code>kv.learner_replicas.enabled</code></td><td>boolean</td><td><code>false</code></td><td>use learner replicas for replica addition</td></tr>
<tr><td><code>kv.raft.command.max_size</code></td><td>byte size</td><td><code>64 MiB</code></td><td>maximum size of a raft command</td></tr>
<tr><td><code>kv.raft_log.disable_synchronization_unsafe</code></td><td>boolean</td><td><code>false</code></td><td>set to true to disable synchronization on Raft log writes to persistent storage. Setting to true risks data loss or data corruption on server crashes. The setting is meant for internal testing only and SHOULD NOT be used in production.</td></tr>
<tr><td><code>kv.range.backpressure_range_size_multiplier</code></td><td>float</td><td><code>2</code></td><td>multiple of range_max_bytes that a range is allowed to grow to without splitting before writes to that range are blocked, or 0 to disable</td></tr>
Expand Down Expand Up @@ -120,6 +121,6 @@
<tr><td><code>trace.debug.enable</code></td><td>boolean</td><td><code>false</code></td><td>if set, traces for recent requests can be seen in the /debug page</td></tr>
<tr><td><code>trace.lightstep.token</code></td><td>string</td><td><code></code></td><td>if set, traces go to Lightstep using this token</td></tr>
<tr><td><code>trace.zipkin.collector</code></td><td>string</td><td><code></code></td><td>if set, traces go to the given Zipkin instance (example: '127.0.0.1:9411'); ignored if trace.lightstep.token is set</td></tr>
<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-5</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-6</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
</tbody>
</table>
14 changes: 14 additions & 0 deletions pkg/roachpb/metadata.go
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,20 @@ func (r ReplicationTarget) String() string {
return fmt.Sprintf("n%d,s%d", r.NodeID, r.StoreID)
}

// ReplicaTypeLearner returns a ReplicaType_LEARNER pointer suitable for use in
// a nullable proto field.
func ReplicaTypeLearner() *ReplicaType {
t := ReplicaType_LEARNER
return &t
}

// ReplicaTypeVoter returns a ReplicaType_VOTER pointer suitable for use in a
// nullable proto field.
func ReplicaTypeVoter() *ReplicaType {
t := ReplicaType_VOTER
return &t
}

func (r ReplicaDescriptor) String() string {
var buf bytes.Buffer
fmt.Fprintf(&buf, "(n%d,s%d):", r.NodeID, r.StoreID)
Expand Down
80 changes: 78 additions & 2 deletions pkg/roachpb/metadata_replicas.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@

package roachpb

import "sort"
import (
"fmt"
"sort"
)

// ReplicaDescriptors is a set of replicas, usually the nodes/stores on which
// replicas of a range are stored.
Expand Down Expand Up @@ -54,6 +57,77 @@ func (d ReplicaDescriptors) Voters() []ReplicaDescriptor {
}

// Learners returns the learner replicas in the set.
//
// A learner is a participant in a raft group that accepts messages but doesn't
// vote. This means it doesn't affect raft quorum and thus doesn't affect the
// fragility of the range, even if it's very far behind or many learners are
// down.
//
// At the time of writing, learners are used in CockroachDB as an interim state
// while adding a replica. A learner replica is added to the range via raft
// ConfChange, a raft snapshot (of type LEARNER) is sent to catch it up, and
// then a second ConfChange promotes it to a full replica.
//
// This means that learners are currently always expected to have a short
// lifetime, approximately the time it takes to send a snapshot. Ideas have been
// kicked around to use learners with follower reads, which could be a cheap way
// to allow many geographies to have local reads without affecting write
// latencies. If implemented, these learners would have long lifetimes.
//
// For simplicity, CockroachDB treats learner replicas the same as voter
// replicas as much as possible, but there are a few exceptions:
//
// - Learner replicas are not considered when calculating quorum size, and thus
// do not affect the computation of which ranges are under-replicated for
// upreplication/alerting/debug/etc purposes. Ditto for over-replicated.
// - Learner replicas cannot become raft leaders, so we also don't allow them to
// become leaseholders. As a result, DistSender and the various oracles don't
// try to send them traffic.
// - The raft snapshot queue does not send snapshots to learners for reasons
// described below.
// - Merges won't run while a learner replica is present.
//
// Replicas are now added in two ConfChange transactions. The first creates the
// learner and the second promotes it to a voter. If the node that is
// coordinating this dies in the middle, we're left with an orphaned learner.
// For this reason, the replicate queue always first removes any learners it
// sees before doing anything else. We could instead try to finish off the
// learner snapshot and promotion, but this is more complicated and it's not yet
// clear the efficiency win is worth it.
//
// This introduces some rare races between the replicate queue and
// AdminChangeReplicas or if a range's lease is moved to a new owner while the
// old leaseholder is still processing it in the replicate queue. These races
// are handled by retrying if a learner disappears during the
// snapshot/promotion.
//
// If the coordinator otherwise encounters an error while sending the learner
// snapshot or promoting it (which can happen for a number of reasons, including
// the node getting the learner going away), it tries to clean up after itself
// by rolling back the addition of the learner.
//
// There is another race between the learner snapshot being sent and the raft
// snapshot queue happening to check the replica at the same time, also sending
// it a snapshot. This is safe but wasteful, so the raft snapshot queue won't
// try to send snapshots to learners.
//
// Merges are blocked if either side has a learner (to avoid working out the
// edge cases) but it's historically turned out to be a bad idea to get in the
// way of splits, so we allow them even when some of the replicas are learners.
// This orphans a learner on each side of the split (the original coordinator
// will not be able to finish either of them), but the replication queue will
// eventually clean them up.
//
// Learner replicas don't affect quorum but they do affect the system in other
// ways. The most obvious way is that the leader sends them the raft traffic it
// would send to any follower, consuming resources. More surprising is that once
// the learner has received a snapshot, it's considered by the quota pool that
// prevents the raft leader from getting too far ahead of the followers. This is
// because a learner (especially one that already has a snapshot) is expected to
// very soon be a voter, so we treat it like one. However, it means a slow
// learner can slow down regular traffic, which is possibly counterintuitive.
//
// For some related mega-comments, see Replica.sendSnapshot.
func (d ReplicaDescriptors) Learners() []ReplicaDescriptor {
// Note that the wrapped replicas are sorted first by type.
for i := range d.wrapped {
Expand All @@ -64,7 +138,9 @@ func (d ReplicaDescriptors) Learners() []ReplicaDescriptor {
return nil
}

var _, _ = ReplicaDescriptors.All, ReplicaDescriptors.Learners
func (d ReplicaDescriptors) String() string {
return fmt.Sprintf("%s", d.AsProto())
}

// AsProto returns the protobuf representation of these replicas, suitable for
// setting the InternalReplicas field of a RangeDescriptor. When possible the
Expand Down
6 changes: 6 additions & 0 deletions pkg/settings/cluster/cockroach_versions.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ const (
VersionQueryTxnTimestamp
VersionStickyBit
VersionParallelCommits
VersionLearnerReplicas
VersionGenerationComparable

// Add new versions here (step one of two).
Expand Down Expand Up @@ -483,6 +484,11 @@ var versionsSingleton = keyedVersions([]keyedVersion{
Key: VersionGenerationComparable,
Version: roachpb.Version{Major: 19, Minor: 1, Unstable: 5},
},
{
// VersionLearnerReplicas is https://github.com/cockroachdb/cockroach/pull/38149.
Key: VersionLearnerReplicas,
Version: roachpb.Version{Major: 19, Minor: 1, Unstable: 6},
},

// Add new versions here (step two of two).

Expand Down
7 changes: 4 additions & 3 deletions pkg/settings/cluster/versionkey_string.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pkg/sql/ambiguous_commit_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ func TestAmbiguousCommit(t *testing.T) {
tableStartKey.Store(keys.MakeTablePrefix(tableID))

// Wait for new table to split & replication.
if err := tc.WaitForSplitAndReplication(tableStartKey.Load().([]byte)); err != nil {
if err := tc.WaitForSplitAndInitialization(tableStartKey.Load().([]byte)); err != nil {
t.Fatal(err)
}

Expand Down
30 changes: 20 additions & 10 deletions pkg/sql/crdb_internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -1715,6 +1715,7 @@ CREATE TABLE crdb_internal.ranges_no_leases (
table_name STRING NOT NULL,
index_name STRING NOT NULL,
replicas INT[] NOT NULL,
non_quorum_replicas INT[] NOT NULL,
split_enforced_until TIMESTAMP
)
`,
Expand Down Expand Up @@ -1768,17 +1769,25 @@ CREATE TABLE crdb_internal.ranges_no_leases (
return nil, err
}

// TODO(dan): We're trying to treat learners as a far-behind replica as
// much as possible, so just include them in the list of replicas. We can
// add a separate column for them if we get feedback about it.
var replicas []int
var voterReplicas, nonVoterReplicas []int
for _, rd := range desc.Replicas().All() {
replicas = append(replicas, int(rd.StoreID))
if rd.GetType() == roachpb.ReplicaType_VOTER {
voterReplicas = append(voterReplicas, int(rd.StoreID))
} else {
nonVoterReplicas = append(nonVoterReplicas, int(rd.StoreID))
}
}
sort.Ints(voterReplicas)
sort.Ints(nonVoterReplicas)
votersArr := tree.NewDArray(types.Int)
for _, replica := range voterReplicas {
if err := votersArr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
return nil, err
}
}
sort.Ints(replicas)
arr := tree.NewDArray(types.Int)
for _, replica := range replicas {
if err := arr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
nonVotersArr := tree.NewDArray(types.Int)
for _, replica := range nonVoterReplicas {
if err := nonVotersArr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
return nil, err
}
}
Expand Down Expand Up @@ -1811,7 +1820,8 @@ CREATE TABLE crdb_internal.ranges_no_leases (
tree.NewDString(dbName),
tree.NewDString(tableName),
tree.NewDString(indexName),
arr,
votersArr,
nonVotersArr,
splitEnforcedUntil,
}, nil
}, nil
Expand Down
46 changes: 45 additions & 1 deletion pkg/storage/allocator.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ const (
minReplicaWeight = 0.001

// priorities for various repair operations.
removeLearnerReplicaPriority float64 = 12001
addDeadReplacementPriority float64 = 12000
addMissingReplicaPriority float64 = 10000
addDecommissioningReplacementPriority float64 = 5000
Expand Down Expand Up @@ -98,6 +99,7 @@ const (
AllocatorAdd
AllocatorRemoveDead
AllocatorRemoveDecommissioning
AllocatorRemoveLearner
AllocatorConsiderRebalance
)

Expand All @@ -107,6 +109,7 @@ var allocatorActionNames = map[AllocatorAction]string{
AllocatorAdd: "add",
AllocatorRemoveDead: "remove dead",
AllocatorRemoveDecommissioning: "remove decommissioning",
AllocatorRemoveLearner: "remove learner",
AllocatorConsiderRebalance: "consider rebalance",
}

Expand Down Expand Up @@ -294,8 +297,49 @@ func (a *Allocator) ComputeAction(
// Do nothing if storePool is nil for some unittests.
return AllocatorNoop, 0
}
// TODO(mrtracy): Handle non-homogeneous and mismatched attribute sets.

// Seeing a learner replica at this point is unexpected because learners are a
// short-lived (ish) transient state in a learner+snapshot+voter cycle, which
// is always done atomically. Only two places could have added a learner: the
// replicate queue or AdminChangeReplicas request.
//
// The replicate queue only operates on leaseholders, which means that only
// one node at a time is operating on a given range except in rare cases (old
// leaseholder could start the operation, and a new leaseholder steps up and
// also starts an overlapping operation). Combined with the above atomicity,
// this means that if the replicate queue sees a learner, either the node that
// was adding it crashed somewhere in the learner+snapshot+voter cycle and
// we're the new leaseholder or we caught a race.
//
// In the first case, we could assume the node that was adding it knew what it
// was doing and finish the addition. Or we could leave it and do higher
// priority operations first if there are any. However, this comes with code
// complexity and concept complexity (computing old vs new quorum sizes
// becomes ambiguous, the learner isn't in the quorum but it likely will be
// soon, so do you count it?). Instead, we do the simplest thing and remove it
// before doing any other operations to the range. We'll revisit this decision
// if and when the complexity becomes necessary.
//
// If we get the race where AdminChangeReplicas is adding a replica and the
// queue happens to run during the snapshot, this will remove the learner and
// AdminChangeReplicas will notice either during the snapshot transfer or when
// it tries to promote the learner to a voter. AdminChangeReplicas should
// retry.
//
// On the other hand if we get the race where a leaseholder starts adding a
// replica in the replicate queue and during this loses its lease, it should
// probably not retry.
if learners := rangeInfo.Desc.Replicas().Learners(); len(learners) > 0 {
// TODO(dan): Since this goes before anything else, the priority here should
// be influenced by whatever operations would happen right after the learner
// is removed. In the meantime, we don't want to block something important
// from happening (like addDeadReplacementPriority) by queueing this at a
// low priority so until this TODO is done, keep
// removeLearnerReplicaPriority as the highest priority.
return AllocatorRemoveLearner, removeLearnerReplicaPriority
}

// TODO(mrtracy): Handle non-homogeneous and mismatched attribute sets.
have := len(rangeInfo.Desc.Replicas().Unwrap())
decommissioningReplicas := a.storePool.decommissioningReplicas(
rangeInfo.Desc.RangeID, rangeInfo.Desc.Replicas().Unwrap())
Expand Down
35 changes: 35 additions & 0 deletions pkg/storage/allocator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ import (
"github.com/gogo/protobuf/proto"
"github.com/olekukonko/tablewriter"
"github.com/pkg/errors"
"github.com/stretchr/testify/require"
"go.etcd.io/etcd/raft"
)

Expand Down Expand Up @@ -4769,6 +4770,40 @@ func TestAllocatorComputeActionDecommission(t *testing.T) {
}
}

func TestAllocatorRemoveLearner(t *testing.T) {
defer leaktest.AfterTest(t)()

zone := config.ZoneConfig{
NumReplicas: proto.Int32(3),
}
learnerType := roachpb.ReplicaType_LEARNER
rangeWithLearnerDesc := roachpb.RangeDescriptor{
InternalReplicas: []roachpb.ReplicaDescriptor{
{
StoreID: 1,
NodeID: 1,
ReplicaID: 1,
},
{
StoreID: 2,
NodeID: 2,
ReplicaID: 2,
Type: &learnerType,
},
},
}

// Removing a learner is prioritized over adding a new replica to an under
// replicated range.
stopper, _, sp, a, _ := createTestAllocator(10, false /* deterministic */)
ctx := context.Background()
defer stopper.Stop(ctx)
live, dead := []roachpb.StoreID{1, 2}, []roachpb.StoreID{3}
mockStorePool(sp, live, nil, dead, nil, nil)
action, _ := a.ComputeAction(ctx, &zone, RangeInfo{Desc: &rangeWithLearnerDesc})
require.Equal(t, AllocatorRemoveLearner, action)
}

func TestAllocatorComputeActionDynamicNumReplicas(t *testing.T) {
defer leaktest.AfterTest(t)()

Expand Down
12 changes: 12 additions & 0 deletions pkg/storage/batcheval/cmd_lease.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/storage/engine/enginepb"
"github.com/cockroachdb/cockroach/pkg/storage/spanset"
"github.com/cockroachdb/cockroach/pkg/storage/storagepb"
"github.com/cockroachdb/errors"
)

func declareKeysRequestLease(
Expand All @@ -41,6 +42,17 @@ func newFailedLeaseTrigger(isTransfer bool) result.Result {
return trigger
}

func checkNotLearnerReplica(rec EvalContext) error {
repDesc, ok := rec.Desc().GetReplicaDescriptor(rec.StoreID())
if !ok {
return errors.AssertionFailedf(
`could not find replica for store %s in %s`, rec.StoreID(), rec.Desc())
} else if t := repDesc.GetType(); t == roachpb.ReplicaType_LEARNER {
return errors.Errorf(`cannot transfer lease to replica of type %s`, t)
}
return nil
}

// evalNewLease checks that the lease contains a valid interval and that
// the new lease holder is still a member of the replica set, and then proceeds
// to write the new lease to the batch, emitting an appropriate trigger.
Expand Down
Loading

0 comments on commit 2f483d3

Please sign in to comment.