storage: use raft learners in replica addition, defaulted off

A learner is a participant in a raft group that accepts messages but doesn't vote. This means it doesn't affect raft quorum and thus doesn't affect the fragility of the range, even if it's very far behind or many learners are down. At the time of writing, learners are used in CockroachDB as an interim state while adding a replica. A learner replica is added to the range via raft ConfChange, a raft snapshot (of type LEARNER) is sent to catch it up, and then a second ConfChange promotes it to a full replica. This means that learners are currently always expected to have a short lifetime, approximately the time it takes to send a snapshot. Ideas have been kicked around to use learners with follower reads, which could be a cheap way to allow many geographies to have local reads without affecting write latencies. If implemented, these learners would have long lifetimes. For simplicity, CockroachDB treats learner replicas the same as voter replicas as much as possible, but there are a few exceptions: - Learner replicas are not considered when calculating quorum size, and thus do not affect the computation of which ranges are under-replicated for upreplication/alerting/debug/etc purposes. Ditto for over-replicated. - Learner replicas cannot become raft leaders, so we also don't allow them to become leaseholders. As a result, DistSender and the various oracles don't try to send them traffic. - The raft snapshot queue does not send snapshots to learners for reasons described below. - Merges won't run while a learner replica is present. Replicas are now added in two ConfChange transactions. The first creates the learner and the second promotes it to a voter. If the node that is coordinating this dies in the middle, we're left with an orphaned learner. For this reason, the replicate queue always first removes any learners it sees before doing anything else. We could instead try to finish off the learner snapshot and promotion, but this is more complicated and it's not yet clear the efficiency win is worth it. This introduces some rare races between the replicate queue and AdminChangeReplicas or if a range's lease is moved to a new owner while the old leaseholder is still processing it in the replicate queue. These races are handled by retrying if a learner disappears during the snapshot/promotion. If the coordinator otherwise encounters an error while sending the learner snapshot or promoting it (which can happen for a number of reasons, including the node getting the learner going away), it tries to clean up after itself by rolling back the addition of the learner. There is another race between the learner snapshot being sent and the raft snapshot queue happening to check the replica at the same time, also sending it a snapshot. This is safe but wasteful, so the raft snapshot queue won't try to send snapshots to learners. Merges are blocked if either side has a learner (to avoid working out the edge cases) but it's historically turned out to be a bad idea to get in the way of splits, so we allow them even when some of the replicas are learners. This orphans a learner on each side of the split (the original coordinator will not be able to finish either of them), but the replication queue will eventually clean them up. Learner replicas don't affect quorum but they do affect the system in other ways. The most obvious way is that the leader sends them the raft traffic it would send to any follower, consuming resources. More surprising is that once the learner has received a snapshot, it's considered by the quota pool that prevents the raft leader from getting too far ahead of the followers. This is because a learner (especially one that already has a snapshot) is expected to very soon be a voter, so we treat it like one. However, it means a slow learner can slow down regular traffic, which is possibly counterintuitive. Release note (general change): Replicas are now added using a raft learner and going through the normal raft snapshot process to catch them up, eliminating technical debt. No user facing changes are expected.
cockroachdb · Jul 15, 2019 · 2f483d3 · 2f483d3
1 parent 4857fa5
commit 2f483d3
Show file tree

Hide file tree

Showing 27 changed files with 1,189 additions and 105 deletions.
diff --git a/docs/generated/settings/settings.html b/docs/generated/settings/settings.html
@@ -38,6 +38,7 @@
 <tr><td><code>kv.closed_timestamp.target_duration</code></td><td>duration</td><td><code>30s</code></td><td>if nonzero, attempt to provide closed timestamp notifications for timestamps trailing cluster time by approximately this duration</td></tr>
 <tr><td><code>kv.follower_read.target_multiple</code></td><td>float</td><td><code>3</code></td><td>if above 1, encourages the distsender to perform a read against the closest replica if a request is older than kv.closed_timestamp.target_duration * (1 + kv.closed_timestamp.close_fraction * this) less a clock uncertainty interval. This value also is used to create follower_timestamp(). (WARNING: may compromise cluster stability or correctness; do not edit without supervision)</td></tr>
 <tr><td><code>kv.import.batch_size</code></td><td>byte size</td><td><code>32 MiB</code></td><td>the maximum size of the payload in an AddSSTable request (WARNING: may compromise cluster stability or correctness; do not edit without supervision)</td></tr>
+<tr><td><code>kv.learner_replicas.enabled</code></td><td>boolean</td><td><code>false</code></td><td>use learner replicas for replica addition</td></tr>
 <tr><td><code>kv.raft.command.max_size</code></td><td>byte size</td><td><code>64 MiB</code></td><td>maximum size of a raft command</td></tr>
 <tr><td><code>kv.raft_log.disable_synchronization_unsafe</code></td><td>boolean</td><td><code>false</code></td><td>set to true to disable synchronization on Raft log writes to persistent storage. Setting to true risks data loss or data corruption on server crashes. The setting is meant for internal testing only and SHOULD NOT be used in production.</td></tr>
 <tr><td><code>kv.range.backpressure_range_size_multiplier</code></td><td>float</td><td><code>2</code></td><td>multiple of range_max_bytes that a range is allowed to grow to without splitting before writes to that range are blocked, or 0 to disable</td></tr>
@@ -120,6 +121,6 @@
 <tr><td><code>trace.debug.enable</code></td><td>boolean</td><td><code>false</code></td><td>if set, traces for recent requests can be seen in the /debug page</td></tr>
 <tr><td><code>trace.lightstep.token</code></td><td>string</td><td><code></code></td><td>if set, traces go to Lightstep using this token</td></tr>
 <tr><td><code>trace.zipkin.collector</code></td><td>string</td><td><code></code></td><td>if set, traces go to the given Zipkin instance (example: '127.0.0.1:9411'); ignored if trace.lightstep.token is set</td></tr>
-<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-5</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
+<tr><td><code>version</code></td><td>custom validation</td><td><code>19.1-6</code></td><td>set the active cluster version in the format '<major>.<minor>'</td></tr>
 </tbody>
 </table>
diff --git a/pkg/roachpb/metadata.go b/pkg/roachpb/metadata.go
@@ -270,6 +270,20 @@ func (r ReplicationTarget) String() string {
 	return fmt.Sprintf("n%d,s%d", r.NodeID, r.StoreID)
 }
 
+// ReplicaTypeLearner returns a ReplicaType_LEARNER pointer suitable for use in
+// a nullable proto field.
+func ReplicaTypeLearner() *ReplicaType {
+	t := ReplicaType_LEARNER
+	return &t
+}
+
+// ReplicaTypeVoter returns a ReplicaType_VOTER pointer suitable for use in a
+// nullable proto field.
+func ReplicaTypeVoter() *ReplicaType {
+	t := ReplicaType_VOTER
+	return &t
+}
+
 func (r ReplicaDescriptor) String() string {
 	var buf bytes.Buffer
 	fmt.Fprintf(&buf, "(n%d,s%d):", r.NodeID, r.StoreID)

diff --git a/pkg/roachpb/metadata_replicas.go b/pkg/roachpb/metadata_replicas.go
@@ -10,7 +10,10 @@
 
 package roachpb
 
-import "sort"
+import (
+	"fmt"
+	"sort"
+)
 
 // ReplicaDescriptors is a set of replicas, usually the nodes/stores on which
 // replicas of a range are stored.
@@ -54,6 +57,77 @@ func (d ReplicaDescriptors) Voters() []ReplicaDescriptor {
 }
 
 // Learners returns the learner replicas in the set.
+//
+// A learner is a participant in a raft group that accepts messages but doesn't
+// vote. This means it doesn't affect raft quorum and thus doesn't affect the
+// fragility of the range, even if it's very far behind or many learners are
+// down.
+//
+// At the time of writing, learners are used in CockroachDB as an interim state
+// while adding a replica. A learner replica is added to the range via raft
+// ConfChange, a raft snapshot (of type LEARNER) is sent to catch it up, and
+// then a second ConfChange promotes it to a full replica.
+//
+// This means that learners are currently always expected to have a short
+// lifetime, approximately the time it takes to send a snapshot. Ideas have been
+// kicked around to use learners with follower reads, which could be a cheap way
+// to allow many geographies to have local reads without affecting write
+// latencies. If implemented, these learners would have long lifetimes.
+//
+// For simplicity, CockroachDB treats learner replicas the same as voter
+// replicas as much as possible, but there are a few exceptions:
+//
+// - Learner replicas are not considered when calculating quorum size, and thus
+//   do not affect the computation of which ranges are under-replicated for
+//   upreplication/alerting/debug/etc purposes. Ditto for over-replicated.
+// - Learner replicas cannot become raft leaders, so we also don't allow them to
+//   become leaseholders. As a result, DistSender and the various oracles don't
+//   try to send them traffic.
+// - The raft snapshot queue does not send snapshots to learners for reasons
+//   described below.
+// - Merges won't run while a learner replica is present.
+//
+// Replicas are now added in two ConfChange transactions. The first creates the
+// learner and the second promotes it to a voter. If the node that is
+// coordinating this dies in the middle, we're left with an orphaned learner.
+// For this reason, the replicate queue always first removes any learners it
+// sees before doing anything else. We could instead try to finish off the
+// learner snapshot and promotion, but this is more complicated and it's not yet
+// clear the efficiency win is worth it.
+//
+// This introduces some rare races between the replicate queue and
+// AdminChangeReplicas or if a range's lease is moved to a new owner while the
+// old leaseholder is still processing it in the replicate queue. These races
+// are handled by retrying if a learner disappears during the
+// snapshot/promotion.
+//
+// If the coordinator otherwise encounters an error while sending the learner
+// snapshot or promoting it (which can happen for a number of reasons, including
+// the node getting the learner going away), it tries to clean up after itself
+// by rolling back the addition of the learner.
+//
+// There is another race between the learner snapshot being sent and the raft
+// snapshot queue happening to check the replica at the same time, also sending
+// it a snapshot. This is safe but wasteful, so the raft snapshot queue won't
+// try to send snapshots to learners.
+//
+// Merges are blocked if either side has a learner (to avoid working out the
+// edge cases) but it's historically turned out to be a bad idea to get in the
+// way of splits, so we allow them even when some of the replicas are learners.
+// This orphans a learner on each side of the split (the original coordinator
+// will not be able to finish either of them), but the replication queue will
+// eventually clean them up.
+//
+// Learner replicas don't affect quorum but they do affect the system in other
+// ways. The most obvious way is that the leader sends them the raft traffic it
+// would send to any follower, consuming resources. More surprising is that once
+// the learner has received a snapshot, it's considered by the quota pool that
+// prevents the raft leader from getting too far ahead of the followers. This is
+// because a learner (especially one that already has a snapshot) is expected to
+// very soon be a voter, so we treat it like one. However, it means a slow
+// learner can slow down regular traffic, which is possibly counterintuitive.
+//
+// For some related mega-comments, see Replica.sendSnapshot.
 func (d ReplicaDescriptors) Learners() []ReplicaDescriptor {
 	// Note that the wrapped replicas are sorted first by type.
 	for i := range d.wrapped {
@@ -64,7 +138,9 @@ func (d ReplicaDescriptors) Learners() []ReplicaDescriptor {
 	return nil
 }
 
-var _, _ = ReplicaDescriptors.All, ReplicaDescriptors.Learners
+func (d ReplicaDescriptors) String() string {
+	return fmt.Sprintf("%s", d.AsProto())
+}
 
 // AsProto returns the protobuf representation of these replicas, suitable for
 // setting the InternalReplicas field of a RangeDescriptor. When possible the

diff --git a/pkg/settings/cluster/cockroach_versions.go b/pkg/settings/cluster/cockroach_versions.go
@@ -39,6 +39,7 @@ const (
 	VersionQueryTxnTimestamp
 	VersionStickyBit
 	VersionParallelCommits
+	VersionLearnerReplicas
 	VersionGenerationComparable
 
 	// Add new versions here (step one of two).
@@ -483,6 +484,11 @@ var versionsSingleton = keyedVersions([]keyedVersion{
 		Key:     VersionGenerationComparable,
 		Version: roachpb.Version{Major: 19, Minor: 1, Unstable: 5},
 	},
+	{
+		// VersionLearnerReplicas is https://github.com/cockroachdb/cockroach/pull/38149.
+		Key:     VersionLearnerReplicas,
+		Version: roachpb.Version{Major: 19, Minor: 1, Unstable: 6},
+	},
 
 	// Add new versions here (step two of two).
 

diff --git a/pkg/settings/cluster/versionkey_string.go b/pkg/settings/cluster/versionkey_string.go
diff --git a/pkg/sql/ambiguous_commit_test.go b/pkg/sql/ambiguous_commit_test.go
@@ -158,7 +158,7 @@ func TestAmbiguousCommit(t *testing.T) {
 		tableStartKey.Store(keys.MakeTablePrefix(tableID))
 
 		// Wait for new table to split & replication.
-		if err := tc.WaitForSplitAndReplication(tableStartKey.Load().([]byte)); err != nil {
+		if err := tc.WaitForSplitAndInitialization(tableStartKey.Load().([]byte)); err != nil {
 			t.Fatal(err)
 		}
 

diff --git a/pkg/sql/crdb_internal.go b/pkg/sql/crdb_internal.go
@@ -1715,6 +1715,7 @@ CREATE TABLE crdb_internal.ranges_no_leases (
   table_name           STRING NOT NULL,
   index_name           STRING NOT NULL,
 	replicas             INT[] NOT NULL,
+	non_quorum_replicas  INT[] NOT NULL,
   split_enforced_until TIMESTAMP
 )
 `,
@@ -1768,17 +1769,25 @@ CREATE TABLE crdb_internal.ranges_no_leases (
 				return nil, err
 			}
 
-			// TODO(dan): We're trying to treat learners as a far-behind replica as
-			// much as possible, so just include them in the list of replicas. We can
-			// add a separate column for them if we get feedback about it.
-			var replicas []int
+			var voterReplicas, nonVoterReplicas []int
 			for _, rd := range desc.Replicas().All() {
-				replicas = append(replicas, int(rd.StoreID))
+				if rd.GetType() == roachpb.ReplicaType_VOTER {
+					voterReplicas = append(voterReplicas, int(rd.StoreID))
+				} else {
+					nonVoterReplicas = append(nonVoterReplicas, int(rd.StoreID))
+				}
+			}
+			sort.Ints(voterReplicas)
+			sort.Ints(nonVoterReplicas)
+			votersArr := tree.NewDArray(types.Int)
+			for _, replica := range voterReplicas {
+				if err := votersArr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
+					return nil, err
+				}
 			}
-			sort.Ints(replicas)
-			arr := tree.NewDArray(types.Int)
-			for _, replica := range replicas {
-				if err := arr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
+			nonVotersArr := tree.NewDArray(types.Int)
+			for _, replica := range nonVoterReplicas {
+				if err := nonVotersArr.Append(tree.NewDInt(tree.DInt(replica))); err != nil {
 					return nil, err
 				}
 			}
@@ -1811,7 +1820,8 @@ CREATE TABLE crdb_internal.ranges_no_leases (
 				tree.NewDString(dbName),
 				tree.NewDString(tableName),
 				tree.NewDString(indexName),
-				arr,
+				votersArr,
+				nonVotersArr,
 				splitEnforcedUntil,
 			}, nil
 		}, nil

diff --git a/pkg/storage/allocator.go b/pkg/storage/allocator.go
@@ -47,6 +47,7 @@ const (
 	minReplicaWeight = 0.001
 
 	// priorities for various repair operations.
+	removeLearnerReplicaPriority          float64 = 12001
 	addDeadReplacementPriority            float64 = 12000
 	addMissingReplicaPriority             float64 = 10000
 	addDecommissioningReplacementPriority float64 = 5000
@@ -98,6 +99,7 @@ const (
 	AllocatorAdd
 	AllocatorRemoveDead
 	AllocatorRemoveDecommissioning
+	AllocatorRemoveLearner
 	AllocatorConsiderRebalance
 )
 
@@ -107,6 +109,7 @@ var allocatorActionNames = map[AllocatorAction]string{
 	AllocatorAdd:                   "add",
 	AllocatorRemoveDead:            "remove dead",
 	AllocatorRemoveDecommissioning: "remove decommissioning",
+	AllocatorRemoveLearner:         "remove learner",
 	AllocatorConsiderRebalance:     "consider rebalance",
 }
 
@@ -294,8 +297,49 @@ func (a *Allocator) ComputeAction(
 		// Do nothing if storePool is nil for some unittests.
 		return AllocatorNoop, 0
 	}
-	// TODO(mrtracy): Handle non-homogeneous and mismatched attribute sets.
 
+	// Seeing a learner replica at this point is unexpected because learners are a
+	// short-lived (ish) transient state in a learner+snapshot+voter cycle, which
+	// is always done atomically. Only two places could have added a learner: the
+	// replicate queue or AdminChangeReplicas request.
+	//
+	// The replicate queue only operates on leaseholders, which means that only
+	// one node at a time is operating on a given range except in rare cases (old
+	// leaseholder could start the operation, and a new leaseholder steps up and
+	// also starts an overlapping operation). Combined with the above atomicity,
+	// this means that if the replicate queue sees a learner, either the node that
+	// was adding it crashed somewhere in the learner+snapshot+voter cycle and
+	// we're the new leaseholder or we caught a race.
+	//
+	// In the first case, we could assume the node that was adding it knew what it
+	// was doing and finish the addition. Or we could leave it and do higher
+	// priority operations first if there are any. However, this comes with code
+	// complexity and concept complexity (computing old vs new quorum sizes
+	// becomes ambiguous, the learner isn't in the quorum but it likely will be
+	// soon, so do you count it?). Instead, we do the simplest thing and remove it
+	// before doing any other operations to the range. We'll revisit this decision
+	// if and when the complexity becomes necessary.
+	//
+	// If we get the race where AdminChangeReplicas is adding a replica and the
+	// queue happens to run during the snapshot, this will remove the learner and
+	// AdminChangeReplicas will notice either during the snapshot transfer or when
+	// it tries to promote the learner to a voter. AdminChangeReplicas should
+	// retry.
+	//
+	// On the other hand if we get the race where a leaseholder starts adding a
+	// replica in the replicate queue and during this loses its lease, it should
+	// probably not retry.
+	if learners := rangeInfo.Desc.Replicas().Learners(); len(learners) > 0 {
+		// TODO(dan): Since this goes before anything else, the priority here should
+		// be influenced by whatever operations would happen right after the learner
+		// is removed. In the meantime, we don't want to block something important
+		// from happening (like addDeadReplacementPriority) by queueing this at a
+		// low priority so until this TODO is done, keep
+		// removeLearnerReplicaPriority as the highest priority.
+		return AllocatorRemoveLearner, removeLearnerReplicaPriority
+	}
+
+	// TODO(mrtracy): Handle non-homogeneous and mismatched attribute sets.
 	have := len(rangeInfo.Desc.Replicas().Unwrap())
 	decommissioningReplicas := a.storePool.decommissioningReplicas(
 		rangeInfo.Desc.RangeID, rangeInfo.Desc.Replicas().Unwrap())

diff --git a/pkg/storage/allocator_test.go b/pkg/storage/allocator_test.go
@@ -43,6 +43,7 @@ import (
 	"github.com/gogo/protobuf/proto"
 	"github.com/olekukonko/tablewriter"
 	"github.com/pkg/errors"
+	"github.com/stretchr/testify/require"
 	"go.etcd.io/etcd/raft"
 )
 
@@ -4769,6 +4770,40 @@ func TestAllocatorComputeActionDecommission(t *testing.T) {
 	}
 }
 
+func TestAllocatorRemoveLearner(t *testing.T) {
+	defer leaktest.AfterTest(t)()
+
+	zone := config.ZoneConfig{
+		NumReplicas: proto.Int32(3),
+	}
+	learnerType := roachpb.ReplicaType_LEARNER
+	rangeWithLearnerDesc := roachpb.RangeDescriptor{
+		InternalReplicas: []roachpb.ReplicaDescriptor{
+			{
+				StoreID:   1,
+				NodeID:    1,
+				ReplicaID: 1,
+			},
+			{
+				StoreID:   2,
+				NodeID:    2,
+				ReplicaID: 2,
+				Type:      &learnerType,
+			},
+		},
+	}
+
+	// Removing a learner is prioritized over adding a new replica to an under
+	// replicated range.
+	stopper, _, sp, a, _ := createTestAllocator(10, false /* deterministic */)
+	ctx := context.Background()
+	defer stopper.Stop(ctx)
+	live, dead := []roachpb.StoreID{1, 2}, []roachpb.StoreID{3}
+	mockStorePool(sp, live, nil, dead, nil, nil)
+	action, _ := a.ComputeAction(ctx, &zone, RangeInfo{Desc: &rangeWithLearnerDesc})
+	require.Equal(t, AllocatorRemoveLearner, action)
+}
+
 func TestAllocatorComputeActionDynamicNumReplicas(t *testing.T) {
 	defer leaktest.AfterTest(t)()
 

diff --git a/pkg/storage/batcheval/cmd_lease.go b/pkg/storage/batcheval/cmd_lease.go
@@ -21,6 +21,7 @@ import (
 	"github.com/cockroachdb/cockroach/pkg/storage/engine/enginepb"
 	"github.com/cockroachdb/cockroach/pkg/storage/spanset"
 	"github.com/cockroachdb/cockroach/pkg/storage/storagepb"
+	"github.com/cockroachdb/errors"
 )
 
 func declareKeysRequestLease(
@@ -41,6 +42,17 @@ func newFailedLeaseTrigger(isTransfer bool) result.Result {
 	return trigger
 }
 
+func checkNotLearnerReplica(rec EvalContext) error {
+	repDesc, ok := rec.Desc().GetReplicaDescriptor(rec.StoreID())
+	if !ok {
+		return errors.AssertionFailedf(
+			`could not find replica for store %s in %s`, rec.StoreID(), rec.Desc())
+	} else if t := repDesc.GetType(); t == roachpb.ReplicaType_LEARNER {
+		return errors.Errorf(`cannot transfer lease to replica of type %s`, t)
+	}
+	return nil
+}
+
 // evalNewLease checks that the lease contains a valid interval and that
 // the new lease holder is still a member of the replica set, and then proceeds
 // to write the new lease to the batch, emitting an appropriate trigger.