Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use raft learners in replica addition, defaulted off #38149

Merged
merged 2 commits into from
Jul 20, 2019

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Jun 13, 2019

A learner is a participant in a raft group that accepts messages but doesn't
vote. This means it doesn't affect raft quorum and thus doesn't affect
the fragility of the range, even if it's very far behind or many
learners are down.

At the time of writing, learners are used in CockroachDB as an interim
state while adding a replica. A learner replica is added to the range
via raft ConfChange, a raft snapshot (of type LEARNER) is sent to catch
it up, and then a second ConfChange promotes it to a full replica.

This means that learners are currently always expected to have a short
lifetime, approximately the time it takes to send a snapshot. Ideas have
been kicked around to use learners with follower reads, which could be a
cheap way to allow many geographies to have local reads without
affecting write latencies. If implemented, these learners would have
long lifetimes.

For simplicity, CockroachDB treats learner replicas the same as voter
replicas as much as possible, but there are a few exceptions:

  • Learner replicas are not considered when calculating quorum size, and
    thus do not affect the computation of which ranges are
    under-replicated for upreplication/alerting/debug/etc purposes. Ditto
    for over-replicated.
  • Learner replicas cannot become raft leaders, so we also don't allow
    them to become leaseholders. As a result, DistSender and the various
    oracles don't try to send them traffic.
  • The raft snapshot queue does not send snapshots to learners for
    reasons described below.
  • Merges won't run while a learner replica is present.

Replicas are now added in two ConfChange transactions. The first creates
the learner and the second promotes it to a voter. If the node that is
coordinating this dies in the middle, we're left with an orphaned
learner. For this reason, the replicate queue always first removes any
learners it sees before doing anything else. We could instead try to
finish off the learner snapshot and promotion, but this is more
complicated and it's not yet clear the efficiency win is worth it.

This introduces some rare races between the replicate queue and
AdminChangeReplicas or if a range's lease is moved to a new owner while
the old leaseholder is still processing it in the replicate queue. These
races are handled by retrying if a learner disappears during the
snapshot/promotion.

If the coordinator otherwise encounters an error while sending the
learner snapshot or promoting it (which can happen for a number of
reasons, including the node getting the learner going away), it tries to
clean up after itself by rolling back the addition of the learner.

There is another race between the learner snapshot being sent and the
raft snapshot queue happening to check the replica at the same time,
also sending it a snapshot. This is safe but wasteful, so the raft
snapshot queue won't try to send snapshots to learners.

Merges are blocked if either side has a learner (to avoid working out
the edge cases) but it's historically turned out to be a bad idea to get
in the way of splits, so we allow them even when some of the replicas
are learners. This orphans a learner on each side of the split (the
original coordinator will not be able to finish either of them), but the
replication queue will eventually clean them up.

Learner replicas don't affect quorum but they do affect the system in
other ways. The most obvious way is that the leader sends them the raft
traffic it would send to any follower, consuming resources. More
surprising is that once the learner has received a snapshot, it's
considered by the quota pool that prevents the raft leader from getting
too far ahead of the followers. This is because a learner (especially
one that already has a snapshot) is expected to very soon be a voter, so
we treat it like one. However, it means a slow learner can slow down
regular traffic, which is possibly counterintuitive.

Release note (general change): Replicas are now added using a raft
learner and going through the normal raft snapshot process to catch them
up, eliminating technical debt. No user facing changes are expected.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! I left some comments but the biggest question seems to be whether we should revisit the decision to "roll back" learners. I still suspect that it's ultimately simpler to just remove them, though if you've come around on the other side of this we should talk.

Reviewed 10 of 10 files at r1, 10 of 10 files at r2, 11 of 11 files at r3, 6 of 6 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/roachpb/metadata.go, line 263 at r2 (raw file):

	}
	if r.Type == ReplicaType_LEARNER {
		buf.WriteString("LEARNER")

Won't this print something like (n1,s1):12LEARNER? Is that what you intended?


pkg/roachpb/metadata.go, line 150 at r3 (raw file):

func (r *RangeDescriptor) RemoveReplica(nodeID NodeID, storeID StoreID) (ReplicaDescriptor, bool) {
	rs := r.Replicas()
	removed, found := rs.RemoveReplica(nodeID, storeID)

nit: removed and found both sound like they could be bools. Consider s/removed/repl/.


pkg/roachpb/metadata.proto, line 43 at r2 (raw file):

enum ReplicaType {
  // ReplicaType_VOTER indicates a replica that participates in all raft
  // activities, including voting for leadership and applying committed entries.

Learners also apply committed entries. s/applying// makes this a better statement: voters help commit entries, learners don't.


pkg/roachpb/metadata.proto, line 47 at r2 (raw file):
This reads a little odd in that the first sentence says only that they don't vote for leaders (but leaves open whether they count for committing entries), and then the second sentence somehow seems to imply that from the first it follows that it doesn't count for general quorum. Perhaps restructure this as

ReplicaType_LEARNER indicates a replica that applies committed entries, but does not count towards the quorum. Learners do not vote for leadership nor do their acknowledged log entries get taken into account for determining the committed index. At the time of writing, ...

A question I have is whether learners can campaign. And it looks like they can. And it looks like they automatically give themselves a vote when it happens? Uhm, that looks like it might be a bug.. I'm going to take a closer look at this.


pkg/roachpb/metadata_replicas.go, line 105 at r3 (raw file):

		return ReplicaDescriptor{}, false
	}
	// Swap with the last element so we can simply truncate the slice.

Don't we have to re-sort if we swap?


pkg/roachpb/metadata_replicas_test.go, line 37 at r2 (raw file):

			}
		}
	}

Also check that you see all the replicas (or at least the right number of them).


pkg/settings/cluster/cockroach_versions.go, line 484 at r3 (raw file):

// VersionLearnerReplicas is #38149.


pkg/storage/allocator.go, line 308 at r3 (raw file):

	// replicate queue or AdminChangeReplicas request.
	//
	// The replicate queue only operates on leaseholders, which means that only

add "except in rare cases" (old leaseholder could start the operation, and a new leaseholder steps up and also starts an overlapping operation)


pkg/storage/allocator.go, line 324 at r3 (raw file):

	// decision if and when the complexity becomes necessary.
	//
	// TODO(dan): Address the race with the AdminChangeReplicas request.

We also need to make sure that if there is a race, we fail and don't add the voter (i.e. the voter being added should require there to be a learner present - this is probably checked during the ChangeReplicas transaction, but please verify).


pkg/storage/replica_command.go, line 824 at r3 (raw file):

				// TODO(dan): It's just as easy to call promoteLearnerReplicaToVoter
				// here and finish the learner up. Is there any downside? We're already
				// trying to add it.

Just doesn't seem worth it unless we go full swing and also tell the allocator to pick up on learners. Otherwise this will just be this one special case and the work put into testing it is likely not worth the upside.


pkg/storage/replica_command.go, line 838 at r3 (raw file):

	useLearners = useLearners && settings.Version.IsActive(cluster.VersionLearnerReplicas)
	if !useLearners {
		// WIP addReplicaLegacyPreemptiveSnapshot should be able to assert that there is

I know this is WIP, but I'm not sure what you're getting at here.


pkg/storage/replica_command.go, line 861 at r3 (raw file):

	// Now move it to be a full voter (which waits on it to get a raft snapshot
	// first, so it's not immediately way behind).
	voterDesc, err := r.promoteLearnerReplicaToVoter(ctx, &learnerDesc, target, priority, reason, details)

Ah, this addresses my comment from above. If the descriptor has since changed from &learnerDesc, this will fail, so this is 100% an upgrade from learner to voter and not just a cold-add of a voter.


pkg/storage/replica_command.go, line 862 at r3 (raw file):

	// first, so it's not immediately way behind).
	voterDesc, err := r.promoteLearnerReplicaToVoter(ctx, &learnerDesc, target, priority, reason, details)
	if err != nil {

May want to handle context.DeadlineExceeded here. If the context is busted, this next invocation will certainly just fail, too. Maybe that's ok, though.


pkg/storage/replica_raftstorage.go, line 667 at r1 (raw file):

}

const (

Nice to see this wart removed.


pkg/storage/store_snapshot.go, line 151 at r1 (raw file):

			// 19.1 nodes don't set this field, so it may default to RAFT. If it is
			// RAFT and the replica is a placeholder (ID = 0), then this came from a
			// 19.1 node and it's actually PREEMPTIVE. LEARNER starting being set

s/starting/started/
s/populated/introduced

I think? Anyway I can't really parse this as is. You want to say that learners were never used before the snapType field turned into an enum.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/roachpb/metadata.go, line 263 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Won't this print something like (n1,s1):12LEARNER? Is that what you intended?

It's what I intended, but I don't love it, definitely taking suggestions. I wanted it to be more obvious that just an "L"


pkg/roachpb/metadata.go, line 150 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: removed and found both sound like they could be bools. Consider s/removed/repl/.

Done


pkg/roachpb/metadata.proto, line 43 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Learners also apply committed entries. s/applying// makes this a better statement: voters help commit entries, learners don't.

Done


pkg/roachpb/metadata.proto, line 47 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This reads a little odd in that the first sentence says only that they don't vote for leaders (but leaves open whether they count for committing entries), and then the second sentence somehow seems to imply that from the first it follows that it doesn't count for general quorum. Perhaps restructure this as

ReplicaType_LEARNER indicates a replica that applies committed entries, but does not count towards the quorum. Learners do not vote for leadership nor do their acknowledged log entries get taken into account for determining the committed index. At the time of writing, ...

A question I have is whether learners can campaign. And it looks like they can. And it looks like they automatically give themselves a vote when it happens? Uhm, that looks like it might be a bug.. I'm going to take a closer look at this.

Done and lol


pkg/roachpb/metadata_replicas.go, line 105 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't we have to re-sort if we swap?

Ooooo good catch


pkg/roachpb/metadata_replicas_test.go, line 37 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Also check that you see all the replicas (or at least the right number of them).

Done


pkg/settings/cluster/cockroach_versions.go, line 484 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// VersionLearnerReplicas is #38149.

Done


pkg/storage/allocator.go, line 308 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

add "except in rare cases" (old leaseholder could start the operation, and a new leaseholder steps up and also starts an overlapping operation)

Done


pkg/storage/allocator.go, line 324 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We also need to make sure that if there is a race, we fail and don't add the voter (i.e. the voter being added should require there to be a learner present - this is probably checked during the ChangeReplicas transaction, but please verify).

Yup, you saw this below. All of this needs tests of course.

I was trying yesterday to get the etcd/raft ConfChange plus this removal (which I was thinking was enough to introduce the cluster version) into the other PR to be merged now, but I was having issues getting an abandoned learner in a test so I could verify the removal, so I decided to pull that out


pkg/storage/replica_command.go, line 824 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just doesn't seem worth it unless we go full swing and also tell the allocator to pick up on learners. Otherwise this will just be this one special case and the work put into testing it is likely not worth the upside.

I see these as separable concerns that each deserve their own decision. As I mentioned in our 1:1, I feel strongly that for now the allocator should remove any learners it sees, and we can add the complexity to do something smarter only when it becomes necessary.

I haven't made up on my mind either way on this TODO. If someone has requested to add literally this exact replica and it's already a learner, it just seems silly to remove it and tell them to request the addition again. This also potentially disrupts whatever other addReplica added the learner, which may still be ongoing. Plus given the races between old/new leaseholder replication queue or replication queue/AdminChangeReplicas, it seems like this is worth a test even if we don't finish up the learner, so I'm not sure we save any work on testing.

On the other hand, I share your exact concerns.


pkg/storage/replica_command.go, line 838 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I know this is WIP, but I'm not sure what you're getting at here.

detritus from an earlier revision


pkg/storage/replica_command.go, line 861 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ah, this addresses my comment from above. If the descriptor has since changed from &learnerDesc, this will fail, so this is 100% an upgrade from learner to voter and not just a cold-add of a voter.

Yay for CPut!


pkg/storage/replica_command.go, line 862 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

May want to handle context.DeadlineExceeded here. If the context is busted, this next invocation will certainly just fail, too. Maybe that's ok, though.

Oooh hmm almost seems like I always want a new context for the rollback, added a WIP to think this through some more


pkg/storage/store_snapshot.go, line 151 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/starting/started/
s/populated/introduced

I think? Anyway I can't really parse this as is. You want to say that learners were never used before the snapType field turned into an enum.

Rewrote this entirely. I think it's better now

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/roachpb/metadata.go, line 263 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

It's what I intended, but I don't love it, definitely taking suggestions. I wanted it to be more obvious that just an "L"

I also don't have a great suggestion here and I like how obvious what you have is.


pkg/storage/replica_command.go, line 824 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I see these as separable concerns that each deserve their own decision. As I mentioned in our 1:1, I feel strongly that for now the allocator should remove any learners it sees, and we can add the complexity to do something smarter only when it becomes necessary.

I haven't made up on my mind either way on this TODO. If someone has requested to add literally this exact replica and it's already a learner, it just seems silly to remove it and tell them to request the addition again. This also potentially disrupts whatever other addReplica added the learner, which may still be ongoing. Plus given the races between old/new leaseholder replication queue or replication queue/AdminChangeReplicas, it seems like this is worth a test even if we don't finish up the learner, so I'm not sure we save any work on testing.

On the other hand, I share your exact concerns.

👍


pkg/storage/replica_command.go, line 866 at r5 (raw file):

		rollbackDesc.SetReplicas(learnerDesc.Replicas().DeepCopy())
		rollbackDesc.RemoveReplica(target.NodeID, target.StoreID)
		// WIP seems like we may want a new context here; if (for example) the

We're also sure to fail if the learner wasn't added, which is ... maybe likely?

@danhhz danhhz force-pushed the learners branch 3 times, most recently from da478f7 to 8cbb381 Compare July 11, 2019 17:22
@danhhz danhhz marked this pull request as ready for review July 11, 2019 18:00
@danhhz danhhz requested a review from a team July 11, 2019 18:00
@danhhz danhhz requested a review from a team as a code owner July 11, 2019 18:00
@danhhz danhhz requested review from a team July 11, 2019 18:00
@danhhz
Copy link
Contributor Author

danhhz commented Jul 11, 2019

I thought of some new tests to write and there are a couple loose ends to tie up, but this is starting to round into shape.

@danhhz
Copy link
Contributor Author

danhhz commented Jul 12, 2019

Hey @tbg this is getting close, but I got stuck in a few places and could use some pointers.

The big one is forcing materialization of the *Replica after the ConfChange txn commits. This comes up when I'm using my new TestingKnob to bail before the snapshot. The snapshot normally will force this, but I'm hoping I could poke it from test code somehow... (I swear this worked as-is somehow before I rebased, but that's crazy right?)

Also wondering if there's some way to verify at the raft level that the replica really is a learner. I tried digging into how we get the ConfState for a snapshot and it looks like we synthesize it from the descriptor (and it's wrong for learners lol, glad I caught this).

Finally, I'm curious if you have thoughts on the WIP I have in TestGCQueueSeesLearner. For the other ones, I found some metrics that get incremented as a side effect, but no luck for this one. At the moment, I can't tell if it ran and correctly did nothing or if it didn't run (e.g. if the test rots).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if there's some way to verify at the raft level that the replica really is a learner. I tried digging into how we get the ConfState for a snapshot and it looks like we synthesize it from the descriptor (and it's wrong for learners lol, glad I caught this).

You can get the RaftStatus from any of the peers (assuming it's applied the latest config change already) and check the Progress (keyed on replicaID), which is an IsLearner field.

The big one is forcing materialization of the *Replica after the ConfChange txn commits. This comes up when I'm using my new TestingKnob to bail before the snapshot. The snapshot normally will force this, but I'm hoping I could poke it from test code somehow... (I swear this worked as-is somehow before I rebased, but that's crazy right?)

I'll leave a comment in the code when I'm more sure what you're trying to do, but sounds like you're trying to test the absence of the materialization? That's generally difficult.

Just finished the first full review -- looks pretty good! Before we merge I'm going to want to give this another "from scratch" review but let's first figure out all the roadblocks and then I'll let you do a cleanup pass.

Reviewed 34 of 34 files at r6, 22 of 22 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/roachpb/metadata.go, line 279 at r7 (raw file):

// a nullable proto field.
func ReplicaTypeLearner() *ReplicaType {
	return &replicaTypeLearner

Are these on anything resembling a hot path? Because sharing mutable memory like that makes me nervous, so I'd avoid it if it's not noticeable.


pkg/sql/crdb_internal.go, line 1773 at r6 (raw file):

			// TODO(dan): We're trying to treat learners as a far-behind replica as
			// much as possible, so just include them in the list of replicas. We can
			// add a separate column for them if we get feedback about it.

This is going to backfire, we use this column specifically to check the number of voters. We need to either exclude learners or add the new column, I'd say add the column now, otherwise we'll end up with this annoying papercut that will probably stay around for a while.


pkg/storage/raft_snapshot_queue.go, line 114 at r7 (raw file):

	//
	// WIP this raft snapshot was causing the *Replica to be instantiated, which
	// all the tests block on, so uncommenting this breaks all the learner tests

Hmm, that's surprising. How are you relying on the log queue to send the snapshots? Doesn't the learner snap do the exact same thing? And even in the absence of a snap, shouldn't the leader contact the follower, at least within a few hundred ms tops?


pkg/storage/replica_command.go, line 1016 at r7 (raw file):

happens


pkg/storage/replica_command.go, line 1018 at r7 (raw file):

		// This is a programming error if it happenes. Why are we rolling back
		// something that's not present?
		log.Warningf(ctx, "failed to rollback learner %s, missing from descriptor %s", target, desc)

Everyone's favorite bug: return


pkg/storage/replica_command.go, line 1026 at r7 (raw file):

	// some timeout that should fire unless something is going really wrong).
	tags := logtags.FromContext(ctx)
	const rollbackTimeout = 5 * time.Minute

This timeout seems too agressive. The queue calls this and we don't want to block the queue for five minutes. Let's say 10s?


pkg/storage/replica_command.go, line 1035 at r7 (raw file):

		log.Infof(ctx,
			"failed to rollback learner %s, abandoning it for the replicate queue %v", target, err)
		// TODO(dan): Queue it up here to speed this process up?

Sure, doesn't hurt.


pkg/storage/replica_learners_test.go, line 59 at r7 (raw file):

		repl = store.LookupReplica(roachpb.RKey(key))
		if repl == nil {
			return errors.New(`could not find learner replica`)

s/learner//


pkg/storage/replica_learners_test.go, line 75 at r7 (raw file):

	var c int64
	var found bool
	store.Registry().Each(func(n string, v interface{}) {

Just acccess store.Metrics().YourField.Count() directly, any particular reason not to?


pkg/storage/replica_learners_test.go, line 97 at r7 (raw file):

		st := tc.Server(i).ClusterSettings()
		st.Manual.Store(true)
		storage.UseLearnerReplicas.Override(&st.SV, true)

I should set up a hot key for this comment!

Override doesn't reliably work in full servers and should never be used there because incoming Gossip updates will nix it and it's annoying to get it right. Use SET CLUSTER SETTING, which also waits for the setting to be active (on the node on which you execute it) before it returns.


pkg/storage/replica_learners_test.go, line 145 at r7 (raw file):

	desc = tc.LookupRangeOrFatal(t, scratchStartKey)
	require.Len(t, desc.Replicas().Voters(), 2)

Hooray 👌


pkg/storage/replica_learners_test.go, line 206 at r7 (raw file):

	_, err = tc.Server(0).MergeRanges(scratchStartKey)
	if !testutils.IsError(err, `ranges not collocated`) {

Can you make this error nicer, so that it mentions that a learner blocked things?


pkg/storage/replica_learners_test.go, line 209 at r7 (raw file):

		t.Fatalf(`expected "ranges not collocated" error got: %+v`, err)
	}
	// WIP what happens now though?

It might be better to block the learner addition instead, so that you can run the merge while it's blocked, verify the error, then continue the addition, and merge again in the end. This is close enough to what the other test is doing that you might consider merging them. Or you leave as is.

It does generally seem much nicer to untangle these dependencies between the queues that we're introducing by allowing a learner to be upgraded (then merges would just relocate to match the LHS as before, learners and all). I wonder how hard that is. Definitely not in this first cut, though.


pkg/storage/replica_learners_test.go, line 294 at r7 (raw file):

	require.NoError(t, err)
	require.Equal(t, ``, errMsg)
	// WIP seems like this is pretty brittle if this stops running the queue, the

Prior to orphaning the learner, you could add and remove a voter. After the removal, there's a gc'able replica, which you can verify gets removed. Then you orphan the learner and verify again, but now you have confidence that the queue in principle will remove replicas.


pkg/storage/replica_learners_test.go, line 304 at r7 (raw file):

func TestRaftSnapshotQueueSeesLearner(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip(`WIP`)

Anchor comment so that I come back to this when it's ready.


pkg/storage/replica_learners_test.go, line 341 at r7 (raw file):

func TestLearnerReplicateQueueRace(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip(`WIP`)

Anchor comment so that I come back to this when it's ready.


pkg/storage/replica_learners_test.go, line 408 at r7 (raw file):

	desc := tc.LookupRangeOrFatal(t, scratchStartKey)
	err := tc.TransferRangeLease(desc, tc.Target(1))

Also add a unit test in batcheval to check the check in cmd_lease_transfer.go, which is the one that really matters because it 100% prevents leases going to learners (only command evaluation sees the authoritative desc and linearizes everything properly)


pkg/storage/replica_raftstorage.go, line 543 at r7 (raw file):

	}
	// WIP pretty sure the learners need to go in cs.Learners, but no test seems
	// to be catching this right now.

Oops :-) I think this calls for two tests:

  1. after creating a learner, check all replicas' in-mem confstates (this catches this bug here)
  2. then restart the node (at least the learner, but in principle all of them) and verify that the learner is still there (in particular in the conf state), this catches
    for _, rep := range r.mu.state.Desc.Replicas().Unwrap() {
    cs.Nodes = append(cs.Nodes, uint64(rep.ReplicaID))
    }

You can access the configuration somewhat indirectly (but good enough) via repl.RaftStatus().Progress[replicaID].IsLearner. Note that the replica needs to be woken up (i.e. its raft group created) or the raft status will be nil.

Independently you may also want to refactor the conf state creation so that they share code. I'm going to have to muck with that too in a couple of weeks, so I can do it then if you prefer.


pkg/storage/replica_range_lease.go, line 690 at r7 (raw file):

transfer

here and in the test that checks this


pkg/storage/replicate_queue.go, line 605 at r6 (raw file):

	opts transferLeaseOptions,
) (bool, error) {
	// Learner replicas aren't allowed to become the leaseholder or raft leader,

TransferLease (cmd_lease_transfer) needs to have an authoritative check.


pkg/storage/testing_knobs.go, line 189 at r7 (raw file):

	// error is returned from the hook, it's sent as an ERROR SnapshotResponse.
	ReceiveSnapshot func(*SnapshotRequest_Header) error
	// ReplicaAddStopAfterLearner, if true, causes replica adding to return early:

Wonder what's up with the highlighting here.


pkg/testutils/testcluster/testcluster.go, line 512 at r6 (raw file):

A learner replica

also this seems insufficient. The range desc will go like {1} to {1,2LEARNER}, {1,2}, ... so in every odd state you'll pass this first check even though there will be learners after.

I'm noticing now that this method doesn't wait for upreplication. It waits for the replicas that are present in the desc to be initialized. Can you rename this method accordingly (WaitForSplitAndInitialization), it's confusing right now.

@danhhz danhhz force-pushed the learners branch 3 times, most recently from 655e021 to 2f483d3 Compare July 15, 2019 23:25
Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions inline for ya

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/roachpb/metadata.go, line 263 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I also don't have a great suggestion here and I like how obvious what you have is.

This seems to have worked out well in my debugging. We can always revisit if it turns out to be too noisy


pkg/roachpb/metadata.go, line 279 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are these on anything resembling a hot path? Because sharing mutable memory like that makes me nervous, so I'd avoid it if it's not noticeable.

Absolutely not a hot path. Done


pkg/sql/crdb_internal.go, line 1773 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is going to backfire, we use this column specifically to check the number of voters. We need to either exclude learners or add the new column, I'd say add the column now, otherwise we'll end up with this annoying papercut that will probably stay around for a while.

Done. When naming the column, I was thinking it'd be nice to avoid forcing the concept of a raft learner on a user of this table, wdyt?


pkg/storage/raft_snapshot_queue.go, line 114 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, that's surprising. How are you relying on the log queue to send the snapshots? Doesn't the learner snap do the exact same thing? And even in the absence of a snap, shouldn't the leader contact the follower, at least within a few hundred ms tops?

Looks like TestCluster.AddReplicas waits for the newly added Replica to have a descriptor, which doesn't happen until it gets some kind of snapshot right? I moved my TestingKnob to be after the learner snap, which seems to be working well.


pkg/storage/replica_command.go, line 824 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

👍

I thought about this more and convinced myself it won't happen often enough to be worth the complexity


pkg/storage/replica_command.go, line 866 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're also sure to fail if the learner wasn't added, which is ... maybe likely?

Reworked all this


pkg/storage/replica_command.go, line 1016 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

happens

Heh, had this fixed on my local branch right after I sent this, but didn't want to waste the teamcity resources by triggering another run


pkg/storage/replica_command.go, line 1018 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Everyone's favorite bug: return

😱


pkg/storage/replica_command.go, line 1026 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This timeout seems too agressive. The queue calls this and we don't want to block the queue for five minutes. Let's say 10s?

Done.


pkg/storage/replica_command.go, line 1035 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sure, doesn't hurt.

Done.


pkg/storage/replica_learners_test.go, line 59 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/learner//

Done.


pkg/storage/replica_learners_test.go, line 75 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just acccess store.Metrics().YourField.Count() directly, any particular reason not to?

Stuff like queue.replicate.removelearnerreplica isn't stored directly on storage.StoreMetrics, it's registered into storage.StoreMetrics.registry 🤷‍♂️


pkg/storage/replica_learners_test.go, line 97 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I should set up a hot key for this comment!

Override doesn't reliably work in full servers and should never be used there because incoming Gossip updates will nix it and it's annoying to get it right. Use SET CLUSTER SETTING, which also waits for the setting to be active (on the node on which you execute it) before it returns.

Argh! I always remember that one of them does this, but I seem to always mis-remember which one it is. I wonder if we could reasonably fix this papercut, it's unfortunate that cluster settings are so hard to use correctly in tests


pkg/storage/replica_learners_test.go, line 206 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you make this error nicer, so that it mentions that a learner blocked things?

At the moment, it's just comparing the replica descriptors, no learner special case. (Which means it would incorrectly allow the merge if there happened to be a learner on both sides.)


pkg/storage/replica_learners_test.go, line 209 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It might be better to block the learner addition instead, so that you can run the merge while it's blocked, verify the error, then continue the addition, and merge again in the end. This is close enough to what the other test is doing that you might consider merging them. Or you leave as is.

It does generally seem much nicer to untangle these dependencies between the queues that we're introducing by allowing a learner to be upgraded (then merges would just relocate to match the LHS as before, learners and all). I wonder how hard that is. Definitely not in this first cut, though.

Yeah, we've had enough trouble with things blocking splits that I'm also wary of this sort of inter-queue logical dependency. What should our first cut answer be?

Do I remember right that you suggested at some point that we just remove the learners? That's tempting for the first cut given that things already have to be resilient to the replicate queue deleting learners. How would that work? Something like just calling execChangeReplicasTxn?


pkg/storage/replica_learners_test.go, line 294 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Prior to orphaning the learner, you could add and remove a voter. After the removal, there's a gc'able replica, which you can verify gets removed. Then you orphan the learner and verify again, but now you have confidence that the queue in principle will remove replicas.

Used the trace


pkg/storage/replica_learners_test.go, line 304 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Anchor comment so that I come back to this when it's ready.

Done.


pkg/storage/replica_learners_test.go, line 341 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Anchor comment so that I come back to this when it's ready.

Done.


pkg/storage/replica_learners_test.go, line 408 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Also add a unit test in batcheval to check the check in cmd_lease_transfer.go, which is the one that really matters because it 100% prevents leases going to learners (only command evaluation sees the authoritative desc and linearizes everything properly)

I assume RequestLease needs a check too? Any pointers on a good test to cargo cult?


pkg/storage/replica_raftstorage.go, line 543 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Oops :-) I think this calls for two tests:

  1. after creating a learner, check all replicas' in-mem confstates (this catches this bug here)
  2. then restart the node (at least the learner, but in principle all of them) and verify that the learner is still there (in particular in the conf state), this catches
    for _, rep := range r.mu.state.Desc.Replicas().Unwrap() {
    cs.Nodes = append(cs.Nodes, uint64(rep.ReplicaID))
    }

You can access the configuration somewhat indirectly (but good enough) via repl.RaftStatus().Progress[replicaID].IsLearner. Note that the replica needs to be woken up (i.e. its raft group created) or the raft status will be nil.

Independently you may also want to refactor the conf state creation so that they share code. I'm going to have to muck with that too in a couple of weeks, so I can do it then if you prefer.

How should I restart a node in the test? That's still not possible with TestCluster, right?


pkg/storage/replica_range_lease.go, line 690 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

transfer

here and in the test that checks this

Done. Is there any reason to keep this check? Seems like having only the authoritative one is better


pkg/storage/replicate_queue.go, line 605 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

TransferLease (cmd_lease_transfer) needs to have an authoritative check.

Done.


pkg/storage/testing_knobs.go, line 189 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Wonder what's up with the highlighting here.

oh, the comma immediately after the name. let's see if this fixes it


pkg/testutils/testcluster/testcluster.go, line 512 at r6 (raw file):

also this seems insufficient. The range desc will go like {1} to {1,2LEARNER}, {1,2}, ... so in every odd state you'll pass this first check even though there will be learners after.

I don't follow this. Which first check? What do you mean by there will be learners after?

I'm noticing now that this method doesn't wait for upreplication. It waits for the replicas that are present in the desc to be initialized. Can you rename this method accordingly (WaitForSplitAndInitialization), it's confusing right now.

Done

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 24 of 24 files at r8, 27 of 27 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/sql/crdb_internal.go, line 1773 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Done. When naming the column, I was thinking it'd be nice to avoid forcing the concept of a raft learner on a user of this table, wdyt?

Spanner calls them read-write, read-only, and witness replicas (https://cloud.google.com/spanner/docs/replication) so "learner" is probably not going to be our final choice. That said, non_quorum_replicas seems clunky, and this table is crdb_internal which means that anything we intentionally expose to the end user will be presented as a view which has the opportunity to name things whatever it wants. My preference would be to stick with "learners" just to lower the amount of vocabularity out there prematurely.


pkg/storage/raft_snapshot_queue.go, line 114 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Looks like TestCluster.AddReplicas waits for the newly added Replica to have a descriptor, which doesn't happen until it gets some kind of snapshot right? I moved my TestingKnob to be after the learner snap, which seems to be working well.

Yes, it needs a snapshot either from you or the queue. I think I understand what you're saying, you were injecting a problem before your snapshot, so without the queue things would get stuck and you'd never see a descriptor.


pkg/storage/raft_snapshot_queue.go, line 112 at r9 (raw file):

	// that's adding it or it's been orphaned and it's about to be cleaned up.
	// Either way, no point in also sending it a snapshot of type RAFT.
	if repDesc.GetType() == roachpb.ReplicaType_LEARNER {

Can we (also) run this kind of check in shouldQueue? We don't want to queue these replicas in the first place if we can avoid it.


pkg/storage/replica_learners_test.go, line 75 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Stuff like queue.replicate.removelearnerreplica isn't stored directly on storage.StoreMetrics, it's registered into storage.StoreMetrics.registry 🤷‍♂️

Ah... ew. Ok, carry on


pkg/storage/replica_learners_test.go, line 206 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

At the moment, it's just comparing the replica descriptors, no learner special case. (Which means it would incorrectly allow the merge if there happened to be a learner on both sides.)

Are you saying that MergeRanges won't attempt to colocate? Then I'm curious what you're testing here, I thought the point of this test was that the merges will not colocate when there are learners but instead throw their hands up (though we discuss alternatives below).


pkg/storage/replica_learners_test.go, line 209 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Yeah, we've had enough trouble with things blocking splits that I'm also wary of this sort of inter-queue logical dependency. What should our first cut answer be?

Do I remember right that you suggested at some point that we just remove the learners? That's tempting for the first cut given that things already have to be resilient to the replicate queue deleting learners. How would that work? Something like just calling execChangeReplicasTxn?

The relocation logic is unfortunately only present in the queue:

if err := mq.store.DB().AdminRelocateRange(ctx, rhsDesc.StartKey, targets); err != nil {
return err
}

This all comes down to making AdminRelocateRange learner-aware. We should probably just bite that bullet now, and then let the merge queue remove the learners while relocating.


pkg/storage/replica_learners_test.go, line 408 at r7 (raw file):

I assume RequestLease needs a check too

👍

Any pointers on a good test to cargo cult?

I expect that reading

evalCtx := mockEvalCtx{
clusterSettings: st,
desc: &roachpb.RangeDescriptor{RangeID: rangeID},
term: term,
firstIndex: firstIndex,
}
and the surrounding test code should be helpful. You basically just need an evalCtx that belongs to a learner and verify that an error comes out of Request/TransferLease.


pkg/storage/replica_learners_test.go, line 143 at r9 (raw file):

	_, repl := getFirstStoreReplica(t, tc.Server(0), scratchStartKey)
	testutils.SucceedsSoon(t, func() error {
		for _, p := range repl.RaftStatus().Progress {

The key is the replicaID, so you should be able to check exactly the right one.


pkg/storage/replica_learners_test.go, line 464 at r9 (raw file):

func TestLearnerFollowerRead(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip(`WIP`)

Anchor


pkg/storage/replica_raftstorage.go, line 543 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

How should I restart a node in the test? That's still not possible with TestCluster, right?

I was thinking restart the whole cluster, i.e. you init it with data dirs, then stop the whole thing, then start it again from the data dirs. It's possible that this doesn't work in which case yeah we need to do something new :-/


pkg/storage/replica_range_lease.go, line 690 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Done. Is there any reason to keep this check? Seems like having only the authoritative one is better

I'm worried about the "stop using the current lease" line below, which we shouldn't hit. But this also raises the question of what happens when the above check fails to fire... are we stalling the leader for a while? This same problem could also occur if the lease evaluation misfired for any other reason (like the target having been removed) so that problem isn't new, but now we should look into it a bit.


pkg/testutils/testcluster/testcluster.go, line 512 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

also this seems insufficient. The range desc will go like {1} to {1,2LEARNER}, {1,2}, ... so in every odd state you'll pass this first check even though there will be learners after.

I don't follow this. Which first check? What do you mean by there will be learners after?

I'm noticing now that this method doesn't wait for upreplication. It waits for the replicas that are present in the desc to be initialized. Can you rename this method accordingly (WaitForSplitAndInitialization), it's confusing right now.

Done

You check first that there aren't any learners, but the range could still be upreplicating, so you might be satisfied way too early. For example, you see {1,2} and call it good but then the desc becomes {1,2,3LEARNER}.
This whole method doesn't deal with upreplication properly (it just isn't concerned with it it seems), but I wonder what the learner check you added is even buying you.

@danhhz danhhz mentioned this pull request Jul 16, 2019
13 tasks
@danhhz danhhz force-pushed the learners branch 3 times, most recently from 8e80a7c to a65af02 Compare July 16, 2019 20:44
Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/sql/crdb_internal.go, line 1773 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Spanner calls them read-write, read-only, and witness replicas (https://cloud.google.com/spanner/docs/replication) so "learner" is probably not going to be our final choice. That said, non_quorum_replicas seems clunky, and this table is crdb_internal which means that anything we intentionally expose to the end user will be presented as a view which has the opportunity to name things whatever it wants. My preference would be to stick with "learners" just to lower the amount of vocabularity out there prematurely.

Done


pkg/storage/raft_snapshot_queue.go, line 114 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yes, it needs a snapshot either from you or the queue. I think I understand what you're saying, you were injecting a problem before your snapshot, so without the queue things would get stuck and you'd never see a descriptor.

Exactly. Most of my confusion was from thinking that I'd seen this work at some point, but I went back through my reflog to find that point, added some logging, and figured out what was going on. :nothingtodohere:


pkg/storage/raft_snapshot_queue.go, line 112 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can we (also) run this kind of check in shouldQueue? We don't want to queue these replicas in the first place if we can avoid it.

Done.


pkg/storage/replica_learners_test.go, line 75 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ah... ew. Ok, carry on

I think this is something we can fix. Added a comment with a TODO


pkg/storage/replica_learners_test.go, line 206 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are you saying that MergeRanges won't attempt to colocate? Then I'm curious what you're testing here, I thought the point of this test was that the merges will not colocate when there are learners but instead throw their hands up (though we discuss alternatives below).

As discussed offline, punting merges for after this PR (tracked in #38902 so I don't forget)


pkg/storage/replica_learners_test.go, line 209 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The relocation logic is unfortunately only present in the queue:

if err := mq.store.DB().AdminRelocateRange(ctx, rhsDesc.StartKey, targets); err != nil {
return err
}

This all comes down to making AdminRelocateRange learner-aware. We should probably just bite that bullet now, and then let the merge queue remove the learners while relocating.

As discussed offline, punting merges for after this PR (tracked in #38902 so I don't forget)


pkg/storage/replica_learners_test.go, line 408 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I assume RequestLease needs a check too

👍

Any pointers on a good test to cargo cult?

I expect that reading

evalCtx := mockEvalCtx{
clusterSettings: st,
desc: &roachpb.RangeDescriptor{RangeID: rangeID},
term: term,
firstIndex: firstIndex,
}
and the surrounding test code should be helpful. You basically just need an evalCtx that belongs to a learner and verify that an error comes out of Request/TransferLease.

Added something to cmd_lease_test.go. Is that the sort of thing you're thinking?


pkg/storage/replica_learners_test.go, line 143 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The key is the replicaID, so you should be able to check exactly the right one.

Done.


pkg/storage/replica_learners_test.go, line 464 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Anchor

Still didn't figure out why I'm not getting updates, but while I was looking into it, found a way to unblock the test 🤷‍♂️


pkg/storage/replica_raftstorage.go, line 543 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I was thinking restart the whole cluster, i.e. you init it with data dirs, then stop the whole thing, then start it again from the data dirs. It's possible that this doesn't work in which case yeah we need to do something new :-/

I think I made the TestCluster restart work. Boy did this test catch something though. When I fix the ConfState computation, I get

E190716 20:00:39.837862 1132 vendor/go.etcd.io/etcd/raft/raft.go:1358  [n2,s2,r21/2:{-}] 2 can't become learner when restores snapshot [index: 14, term: 6]

Update, you mentioned offline that you fixed this in upstream raft. I'll separately PR an etcd/raft dep bump


pkg/storage/replica_range_lease.go, line 690 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm worried about the "stop using the current lease" line below, which we shouldn't hit. But this also raises the question of what happens when the above check fails to fire... are we stalling the leader for a while? This same problem could also occur if the lease evaluation misfired for any other reason (like the target having been removed) so that problem isn't new, but now we should look into it a bit.

As discussed offline, added a TODO


pkg/testutils/testcluster/testcluster.go, line 512 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You check first that there aren't any learners, but the range could still be upreplicating, so you might be satisfied way too early. For example, you see {1,2} and call it good but then the desc becomes {1,2,3LEARNER}.
This whole method doesn't deal with upreplication properly (it just isn't concerned with it it seems), but I wonder what the learner check you added is even buying you.

Gotcha. Reverted this

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/storage/raft_snapshot_queue.go, line 71 at r10 (raw file):

			if p.State == raft.ProgressStateSnapshot {
				// We refuse to send a snapshot of type RAFT to a learner for reasons
				// described in processRaftSnapshot, so don't bother queueing.

BTW, I am slightly anxious about this, the reason being that if we fail to send a snapshot to a learner, until the replicate queue comes around and removes the learner, it will effectively be a "sticky trap": waiting for any Raft proposal will just take "forever".

We don't funnel any traffic into the learner because we bail on the leaseholder check, so hopefully this is not an issue. The one place it could matter is

// For lease commands, use the provided previous lease for verification.
if ba.IsSingleSkipLeaseCheckRequest() {
lease = ba.GetPrevLeaseForLeaseRequest()
} else {

The requests triggering these path are RequestLease and TransferLease, which should never be sent to the range (because your code doesn't request or transfer the lease for learners), so this is more of a theoretical concern than a practical one, I hope. Worth making a note somewhere, though. For long-lived learners, this could become more of an issue.


pkg/storage/replica_learners_test.go, line 408 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Added something to cmd_lease_test.go. Is that the sort of thing you're thinking?

Yep


pkg/storage/replica_learners_test.go, line 143 at r9 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Done.

In TestLearnerRaftConfState, right?


pkg/storage/replica_learners_test.go, line 464 at r9 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Still didn't figure out why I'm not getting updates, but while I was looking into it, found a way to unblock the test 🤷‍♂️

I checked this out locally, and ... everything just works? I removed first the lease transfer only. Works. Removed even adding Target(2). Still works. Dropped back down to two nodes. Still works. Even under stress (afaict). Can you check again?


pkg/storage/replica_learners_test.go, line 184 at r10 (raw file):

			for _, repl := range repls {
				status := repl.RaftStatus()
				if !status.Progress[uint64(id)].IsLearner {

PS there's something awkward I have discovered which is that you don't get a status.Progress when you're not on the leaseholder. Just something to keep in mind. We should really expose the config properly in the raft status, I added it to etcd-io/etcd#10903. When the vendor bump has landed you can just look at the actual config it's using instead.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/raft_snapshot_queue.go, line 71 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

BTW, I am slightly anxious about this, the reason being that if we fail to send a snapshot to a learner, until the replicate queue comes around and removes the learner, it will effectively be a "sticky trap": waiting for any Raft proposal will just take "forever".

We don't funnel any traffic into the learner because we bail on the leaseholder check, so hopefully this is not an issue. The one place it could matter is

// For lease commands, use the provided previous lease for verification.
if ba.IsSingleSkipLeaseCheckRequest() {
lease = ba.GetPrevLeaseForLeaseRequest()
} else {

The requests triggering these path are RequestLease and TransferLease, which should never be sent to the range (because your code doesn't request or transfer the lease for learners), so this is more of a theoretical concern than a practical one, I hope. Worth making a note somewhere, though. For long-lived learners, this could become more of an issue.

Oof, good point. Added a note, but we should really fix this so put it on the followup issue #38902 to go back to the drawing board on this.


pkg/storage/replica_follower_read.go, line 44 at r10 (raw file):

	// to be short-lived, so it's not worth working out the edge-cases. Revisit if
	// we add long-lived learners.
	repDesc, err := r.GetReplicaDescriptor()

Are we worried about this lock/unlock?


pkg/storage/replica_learners_test.go, line 143 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In TestLearnerRaftConfState, right?

Yes. The TestCluster restart was invasive enough that I didn't want to complicate this test


pkg/storage/replica_learners_test.go, line 464 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I checked this out locally, and ... everything just works? I removed first the lease transfer only. Works. Removed even adding Target(2). Still works. Dropped back down to two nodes. Still works. Even under stress (afaict). Can you check again?

The test passes because the learner check happens first in canServeFollowerRead, but if you remove that check, then maxClosed is always 0 without it. Removed because it's not necessary for making the test pass.


pkg/storage/replica_learners_test.go, line 184 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

PS there's something awkward I have discovered which is that you don't get a status.Progress when you're not on the leaseholder. Just something to keep in mind. We should really expose the config properly in the raft status, I added it to etcd-io/etcd#10903. When the vendor bump has landed you can just look at the actual config it's using instead.

Done. Thanks!

@tbg tbg requested review from tbg and nvanbenschoten July 18, 2019 22:33
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 35 of 35 files at r11, 31 of 31 files at r12.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @nvanbenschoten)


pkg/storage/replica_command.go, line 947 at r12 (raw file):

	generationComparableEnabled := store.ClusterSettings().Version.IsActive(cluster.VersionGenerationComparable)
	if generationComparableEnabled {
		newDesc.IncrementGeneration()

This is parroting how @jeffrey-xiao set things up, right? Just want to make sure we're not supposed to unconditionally increment.


pkg/storage/replica_command.go, line 1002 at r12 (raw file):

		updatedDesc := *desc
		generationComparableEnabled := r.store.ClusterSettings().Version.IsActive(cluster.VersionGenerationComparable)
		if generationComparableEnabled {

Ditto


pkg/storage/replica_command.go, line 1022 at r12 (raw file):

	newDesc := *desc
	generationComparableEnabled := r.store.ClusterSettings().Version.IsActive(cluster.VersionGenerationComparable)
	if generationComparableEnabled {

Tritto


pkg/storage/replica_command.go, line 1149 at r12 (raw file):

	}
	updatedDesc := *desc
	generationComparableEnabled := r.store.ClusterSettings().Version.IsActive(cluster.VersionGenerationComparable)

Quattro


pkg/storage/replica_follower_read.go, line 44 at r10 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Are we worried about this lock/unlock?

I don't think we have strong feelings about shaving a mutex lock off follower reads, but correct me if I'm wrong @nvanbenschoten
It'd be nice to avoid it but not at the cost of complicating the flow here (i.e. don't just move the check to the end where an early return could cause it to be missed). I vote leave as-is.


pkg/storage/replica_learner_test.go, line 42 at r12 (raw file):

)

// TODO(dan): Test learners and quota pool.

These are intentionally left as follow-ups, right? Just double checking.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz, @jeffrey-xiao, @nvanbenschoten, and @tbg)


pkg/storage/replica_command.go, line 947 at r12 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is parroting how @jeffrey-xiao set things up, right? Just want to make sure we're not supposed to unconditionally increment.

Yeah, it used to be ```
generationComparableEnabled := r.store.ClusterSettings().Version.IsActive(cluster.VersionGenerationComparable)
rangeID := desc.RangeID
updatedDesc := *desc
if generationComparableEnabled {
updatedDesc.IncrementGeneration()
updatedDesc.GenerationComparable = proto.Bool(true)
}
updatedDesc.SetReplicas(desc.Replicas().DeepCopy())


Now that I'm looking at this again, though I think I should just move it into `execChangeReplicasTxn` so I can have it once instead of 4x

pkg/storage/replica_learner_test.go, line 42 at r12 (raw file):

Previously, tbg (Tobias Grieger) wrote…

These are intentionally left as follow-ups, right? Just double checking.

Yup. All are listed in the followup issue

@danhhz danhhz force-pushed the learners branch 2 times, most recently from aaaaf19 to 17595a1 Compare July 19, 2019 16:20
@danhhz danhhz requested a review from a team July 19, 2019 16:20
A learner is a participant in a raft group that accepts messages but doesn't
vote. This means it doesn't affect raft quorum and thus doesn't affect
the fragility of the range, even if it's very far behind or many
learners are down.

At the time of writing, learners are used in CockroachDB as an interim
state while adding a replica. A learner replica is added to the range
via raft ConfChange, a raft snapshot (of type LEARNER) is sent to catch
it up, and then a second ConfChange promotes it to a full replica.

This means that learners are currently always expected to have a short
lifetime, approximately the time it takes to send a snapshot. Ideas have
been kicked around to use learners with follower reads, which could be a
cheap way to allow many geographies to have local reads without
affecting write latencies. If implemented, these learners would have
long lifetimes.

For simplicity, CockroachDB treats learner replicas the same as voter
replicas as much as possible, but there are a few exceptions:

- Learner replicas are not considered when calculating quorum size, and
  thus do not affect the computation of which ranges are
  under-replicated for upreplication/alerting/debug/etc purposes. Ditto
  for over-replicated.
- Learner replicas cannot become raft leaders, so we also don't allow
  them to become leaseholders. As a result, DistSender and the various
  oracles don't try to send them traffic.
- The raft snapshot queue does not send snapshots to learners for
  reasons described below.
- Merges won't run while a learner replica is present.

Replicas are now added in two ConfChange transactions. The first creates
the learner and the second promotes it to a voter. If the node that is
coordinating this dies in the middle, we're left with an orphaned
learner. For this reason, the replicate queue always first removes any
learners it sees before doing anything else. We could instead try to
finish off the learner snapshot and promotion, but this is more
complicated and it's not yet clear the efficiency win is worth it.

This introduces some rare races between the replicate queue and
AdminChangeReplicas or if a range's lease is moved to a new owner while
the old leaseholder is still processing it in the replicate queue. These
races are handled by retrying if a learner disappears during the
snapshot/promotion.

If the coordinator otherwise encounters an error while sending the
learner snapshot or promoting it (which can happen for a number of
reasons, including the node getting the learner going away), it tries to
clean up after itself by rolling back the addition of the learner.

There is another race between the learner snapshot being sent and the
raft snapshot queue happening to check the replica at the same time,
also sending it a snapshot. This is safe but wasteful, so the raft
snapshot queue won't try to send snapshots to learners.

Merges are blocked if either side has a learner (to avoid working out
the edge cases) but it's historically turned out to be a bad idea to get
in the way of splits, so we allow them even when some of the replicas
are learners. This orphans a learner on each side of the split (the
original coordinator will not be able to finish either of them), but the
replication queue will eventually clean them up.

Learner replicas don't affect quorum but they do affect the system in
other ways. The most obvious way is that the leader sends them the raft
traffic it would send to any follower, consuming resources. More
surprising is that once the learner has received a snapshot, it's
considered by the quota pool that prevents the raft leader from getting
too far ahead of the followers. This is because a learner (especially
one that already has a snapshot) is expected to very soon be a voter, so
we treat it like one. However, it means a slow learner can slow down
regular traffic, which is possibly counterintuitive.

Release note (general change): Replicas are now added using a raft
learner and going through the normal raft snapshot process to catch them
up, eliminating technical debt. No user facing changes are expected.
@danhhz
Copy link
Contributor Author

danhhz commented Jul 19, 2019

Thanks for the review!

bors r=tbg

craig bot pushed a commit that referenced this pull request Jul 19, 2019
38149: storage: use raft learners in replica addition, defaulted off r=tbg a=danhhz



Co-authored-by: Daniel Harrison <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 20, 2019

Build succeeded

@craig craig bot merged commit 844eac4 into cockroachdb:master Jul 20, 2019
@@ -445,11 +445,18 @@ func (b *propBuf) FlushLockedWithRaftGroup(raftGroup *raft.RawNode) error {
continue
}

if err := raftGroup.ProposeConfChange(raftpb.ConfChange{
confChange := raftpb.ConfChange{
Type: changeTypeInternalToRaft[crt.ChangeType],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep changeTypeInternalToRaft around anymore, now that there's not a 1:1 mapping between roachpb.ReplicaChangeType and raftpb.ConfChangeType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see this until just now, but it's actually getting removed in #39485!

@danhhz danhhz deleted the learners branch April 3, 2020 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants