Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: make AdminRelocateRange work with non-voting replicas #56197

Merged

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Nov 2, 2020

This commit teaches RelocateRange to work with non-voters. This lets
the merge queue rebalance a range that has non-voters so the merge can
actually proceed, as range merges require that replica sets of the LHS
and RHS ranges must be aligned.

Makes progress on #51943

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch 2 times, most recently from 24db83c to 7125b86 Compare November 10, 2020 02:46
@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch 3 times, most recently from 5a9bfde to 0a4436c Compare November 18, 2020 15:51
@aayushshah15 aayushshah15 marked this pull request as ready for review November 18, 2020 16:37
@aayushshah15 aayushshah15 requested a review from a team as a code owner November 18, 2020 16:37
@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch from 0a4436c to 51f993a Compare November 19, 2020 02:01
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 21 of 21 files at r1, 17 of 17 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/batch.go, line 690 at r2 (raw file):

func (b *Batch) adminRelocateRange(
	key interface{},
	voterTargets []roachpb.ReplicationTarget,

nit: it might be cleaner to do voterTargets, nonVoterTargets []roachpb.ReplicationTarget. Same thing in a few places in this commit.


pkg/kv/kvclient/kvcoord/integration_test.go, line 27 at r1 (raw file):

	"github.com/cockroachdb/cockroach/pkg/testutils"
	"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
	hlc2 "github.com/cockroachdb/cockroach/pkg/util/hlc"

Are these hlc2 changes intentional?


pkg/kv/kvserver/allocator_scorer.go, line 413 at r2 (raw file):

	candidateStores StoreList,
	constraints constraint.AnalyzedConstraints,
	existing []roachpb.ReplicaDescriptor,

existingReplicas to stay in sync with allocateTargetFromList?


pkg/kv/kvserver/client_merge_test.go, line 3652 at r2 (raw file):

}

func TestMergeQueueSeesNonVoter(t *testing.T) {

Nice test!


pkg/kv/kvserver/client_merge_test.go, line 3721 at r2 (raw file):

			require.Nil(t, err)
			// We're going to split the dummy range created above with an empty
			// expiration time. Disable the merge queue before splitting so that the

Was this the only reason we added the expiration time in the previous commit? If so, I wonder if we can use an AdminUnsplitRequest here instead.


pkg/kv/kvserver/client_replica_test.go, line 2211 at r2 (raw file):

	db := tc.Servers[0].DB()
	require.Nil(t, db.AdminRelocateRange(
		ctx, key, makeReplicationTargets(1, 2, 3), makeReplicationTargets(),

Same point as elsewhere. Should we just pass nil here? This function call isn't obviously variadic, so it's not clear what it's meant to return when reading this code.


pkg/kv/kvserver/client_replica_test.go, line 2233 at r2 (raw file):

rand.Intn(2) < 1

minor nit: rand.Intn(2) == 0 is the more canonical way to do a coin flip.


pkg/kv/kvserver/closed_timestamp_test.go, line 905 at r1 (raw file):

setupTestClusterWithDummyRange

Is this name still correct?


pkg/kv/kvserver/closed_timestamp_test.go, line 1185 at r1 (raw file):

	dbName string,
	numNodes int,
	callback postSetupCallback,

What's the purpose of this callback? Can't a caller use the returned TestClusterInterface to create a new conn if they need one?


pkg/kv/kvserver/merge_queue.go, line 299 at r2 (raw file):

			return false,
				errors.Errorf(
					`cannot merge because lhs contains replicas that are neither NON_VOTER nor VOTER_FULL: %v`,

Should we use the same new language about "when lhs is in a joint state" here?


pkg/kv/kvserver/merge_queue.go, line 338 at r2 (raw file):

			}
		}
		// We only care about merging ranges that are aligned with regards to their

So this will try to line up lhs voters with rhs voters and lhs non-voters with rhs non-voters? That seems reasonable, though maybe not strictly necessary, right? Does the replica type on the rhs of the merge need to match up with the replica type on the lhs of the merge? Is there even a meaningful difference between replica types once a range has been subsumed and all replicas have caught up on the log?

I'm mostly interested in this if this is causing superfluous data movement. Will the following config require any movement of data or will we be able to promote and demote voters/non-voters on the RHS in place?

lhs: voters = {1}, non-voters = {2}
rhs: voters = {2}, non-voters = {1}

If we can do everything in place, then this seems fine. But if not, then it seems worth digging into whether this is all necessary.


pkg/kv/kvserver/replica_command.go, line 2128 at r2 (raw file):

	rangeDesc = *newDesc

	if len(nonVoterTargets) > 0 {

Any reason to do the non-voters first?


pkg/kv/kvserver/replica_command.go, line 2257 at r2 (raw file):

	isVoter bool,
) ([]roachpb.ReplicationChange, *roachpb.ReplicationTarget, error) {
	if repls := desc.Replicas(); len(repls.Learners()) != 0 || repls.InAtomicReplicationChange() {

Can we use the new predicate you added here?


pkg/kv/kvserver/replica_command.go, line 2430 at r2 (raw file):

}

func targetsToRemove(

nit: you call targetsToAdd first, so might as well keep the same order here.


pkg/kv/kvserver/replica_command.go, line 2430 at r2 (raw file):

}

func targetsToRemove(

Are these the same function, just with flipped args? If so, can/should we avoid the duplication?


pkg/kv/kvserver/replica_rangefeed_test.go, line 117 at r1 (raw file):

	// Split the range so that the RHS uses epoch-based leases.
	startKey := []byte("a")
	tc.SplitRangeOrFatal(t, startKey, hlc.MaxTimestamp)

Would it be worth adding a SplitRangeOrFatalWithExpiration and having SplitRangeOrFatal call into this with hlc.MaxTimestamp as the expiration to avoid all of these changes? It doesn't seem like very many callers want to worry about the expiration time.


pkg/kv/kvserver/testing_knobs.go, line 238 at r2 (raw file):

	BeforeSnapshotSSTIngestion func(IncomingSnapshot, SnapshotRequest_Type, []string) error
	// BeforeRelocateOne intercepts the return values of s.relocateOne
	// before they're being put into effect.

Was this intentional? It doesn't look like the comment was more than 80 chars.


pkg/roachpb/metadata_replicas.go, line 259 at r2 (raw file):

// VotersAndNonVoters returns the list of VOTER_FULL/NON_VOTER replicas in the
// set.

Similar to the comment on ReplicaDescriptors.Voters, it seems worthwhile to explain the meaning of this subset of replicas here. Why is this combination special?


pkg/sql/relocate.go, line 135 at r2 (raw file):

	} else {
		if err := params.p.ExecCfg().DB.AdminRelocateRange(
			params.ctx, rowKey, relocationTargets, []roachpb.ReplicationTarget{},

Can we just pass nil here and avoid the allocation? Same thing in a few places.

@aayushshah15
Copy link
Contributor Author


pkg/kv/kvserver/merge_queue.go, line 338 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So this will try to line up lhs voters with rhs voters and lhs non-voters with rhs non-voters? That seems reasonable, though maybe not strictly necessary, right? Does the replica type on the rhs of the merge need to match up with the replica type on the lhs of the merge? Is there even a meaningful difference between replica types once a range has been subsumed and all replicas have caught up on the log?

I'm mostly interested in this if this is causing superfluous data movement. Will the following config require any movement of data or will we be able to promote and demote voters/non-voters on the RHS in place?

lhs: voters = {1}, non-voters = {2}
rhs: voters = {2}, non-voters = {1}

If we can do everything in place, then this seems fine. But if not, then it seems worth digging into whether this is all necessary.

The example you gave would not cause superfluous data movement. See the !replicaSetsEqual predicate above. This doesn't seem to require anything in the way of voter demotion or non-voter promotion since replica types dont need to match up between the left and right ranges.

However, when the replica sets are not equal, I thought it made sense to try to line up the voters with voters and non-voters with non-voters because it would preserve the status quo with regards to constraints conformance. The merge queue will only get to this point in the code if the zone configs for the ranges in question are the same, which means that, unless we're already in violation of a zone config constraint, it must be legal for us to try and line up non-voter and voter sets individually. We can't say the same about other configurations that align replica sets.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/client_relocate_range_test.go, line 157 at r2 (raw file):

		// Expect two single additions, and that's it.
		requireNumAtomic(0, 2, func() int {
			return relocateAndCheck(t, tc, k, targets, []roachpb.ReplicationTarget{})

you haven't used nils on purpose here?


pkg/kv/kvserver/closed_timestamp_test.go, line 908 at r1 (raw file):

// the {dbname}.kv table and performs splits on the table's range such that the
// 2 resulting ranges contain exactly one of the rows each.
func splitDummyRangeInTestCluster(

This function takes in a dbName but hardcodes the table name. Is this useful? I think if should go one way or the other.


pkg/kv/kvserver/closed_timestamp_test.go, line 933 at r1 (raw file):

	}
	tcImpl := tc.(*testcluster.TestCluster)
	// Split at `k` so that the `kv` table has exactly two ranges: [1,2) and [2,

Have you considered using alter table split at (2)?


pkg/kv/kvserver/closed_timestamp_test.go, line 943 at r1 (raw file):

	}
	leftDesc, rightDesc := tcImpl.SplitRangeOrFatal(t, k, splitExpirationTime)
	if tc.ReplicationMode() != base.ReplicationManual {

Why don't we do this on ReplicationManual? Do we depend on any queues here?


pkg/kv/kvserver/closed_timestamp_test.go, line 1185 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's the purpose of this callback? Can't a caller use the returned TestClusterInterface to create a new conn if they need one?

+1


pkg/kv/kvserver/closed_timestamp_test.go, line 1195 at r1 (raw file):

-- forever under high load (testrace under high concurrency).
SET statement_timeout='30s';
CREATE DATABASE %[1]s;

whoa I didn't know this was a thing


pkg/kv/kvserver/merge_queue.go, line 294 at r2 (raw file):

	// Defensive sanity check that the ranges involved only have either VOTER_FULL
	// or NON_VOTER replicas.

nit: VOTER_FULL and NON_VOTER replicas
:)


pkg/kv/kvserver/merge_queue.go, line 298 at r2 (raw file):

		if typ := lhsReplicas[i].GetType(); !(typ == roachpb.VOTER_FULL || typ == roachpb.NON_VOTER) {
			return false,
				errors.Errorf(

If I am to read anything into the "defensive sanity check" part, this should be an AssertionError? Same below.


pkg/kv/kvserver/merge_queue.go, line 314 at r2 (raw file):

	}

	if !replicaSetsEqual(lhsReplicas, rhsReplicas) {

let's rename replicaSetsEqual to something that makes it more clear that the replica types don't matter. For example replicasCollocated.


pkg/kv/kvserver/merge_queue.go, line 314 at r2 (raw file):

	}

	if !replicaSetsEqual(lhsReplicas, rhsReplicas) {

please give this whole block a comment about what it'll do it !replicaSetsEqual


pkg/kv/kvserver/merge_queue.go, line 338 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

The example you gave would not cause superfluous data movement. See the !replicaSetsEqual predicate above. This doesn't seem to require anything in the way of voter demotion or non-voter promotion since replica types dont need to match up between the left and right ranges.

However, when the replica sets are not equal, I thought it made sense to try to line up the voters with voters and non-voters with non-voters because it would preserve the status quo with regards to constraints conformance. The merge queue will only get to this point in the code if the zone configs for the ranges in question are the same, which means that, unless we're already in violation of a zone config constraint, it must be legal for us to try and line up non-voter and voter sets individually. We can't say the same about other configurations that align replica sets.

So then the example we should be discussing is something like

lhs: voters = {1}, non-voters = {2,3}
rhs: voters = {2}, non-voters = {1,4}

It seems to me that we should make it such that only one replica needs to change here, and the voters for the lhs and rhs can stay as they are.
Btw, it'd be nice if we put a comment around here about what the conditions for the merge are. Say that voters vs non-voters doesn't matter, and say that the leaseholders can be on different nodes (right?).


pkg/kv/kvserver/merge_queue.go, line 341 at r2 (raw file):

		// zone config constraints. In practice, this means that we likely only care
		// about aligning the individual replica sets of the Voters() and
		// NonVoters().

I don't really understand what this comment is trying to tell me.
Can we make the "We only care about merging ranges that are aligned ..." part more precise? Who's we and what happens if this code is asked to merge ranges with different constraints? What does it mean for a range (or for a pair of ranges?) to be "aligned with regards to their zone config constraints"? If what you want to say is that this isn't called for ranges with different zone configs, then I think it'd be much clearer to say exactly that.

And the I don't understand the "in practice" and "likely" parts :). What more could we care about?


pkg/kv/kvserver/replica_rangefeed_test.go, line 117 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would it be worth adding a SplitRangeOrFatalWithExpiration and having SplitRangeOrFatal call into this with hlc.MaxTimestamp as the expiration to avoid all of these changes? It doesn't seem like very many callers want to worry about the expiration time.

+1
If for whatever reason you want to leave it like this, at least make a const NeverExpires = hlc.MaxTimestamp and use it in all the callers to make them self-documenting.

@nvanbenschoten nvanbenschoten added the A-multiregion Related to multi-region label Dec 2, 2020
@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch from 51f993a to 38b8aef Compare December 4, 2020 18:02
@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch 10 times, most recently from aefd3b8 to 1247a2f Compare December 15, 2020 08:18
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/batch.go, line 690 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: it might be cleaner to do voterTargets, nonVoterTargets []roachpb.ReplicationTarget. Same thing in a few places in this commit.

done.


pkg/kv/kvclient/kvcoord/integration_test.go, line 27 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are these hlc2 changes intentional?

these imports are all gone now after I made the change to add a new SplitRangeWithExpiration method.


pkg/kv/kvserver/allocator_scorer.go, line 413 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

existingReplicas to stay in sync with allocateTargetFromList?

done


pkg/kv/kvserver/client_merge_test.go, line 3721 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this the only reason we added the expiration time in the previous commit? If so, I wonder if we can use an AdminUnsplitRequest here instead.

I find the empty expiration bit approach cleaner, but I can make the change if you insist.


pkg/kv/kvserver/client_relocate_range_test.go, line 157 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you haven't used nils on purpose here?

a little embarrassed to admit that I didn't know about the nuance between empty and nil slices, changed to nil slices.


pkg/kv/kvserver/client_replica_test.go, line 2211 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Same point as elsewhere. Should we just pass nil here? This function call isn't obviously variadic, so it's not clear what it's meant to return when reading this code.

Done.


pkg/kv/kvserver/client_replica_test.go, line 2233 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
rand.Intn(2) < 1

minor nit: rand.Intn(2) == 0 is the more canonical way to do a coin flip.

Changed.


pkg/kv/kvserver/closed_timestamp_test.go, line 905 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
setupTestClusterWithDummyRange

Is this name still correct?

I think so, unless I'm missing something here


pkg/kv/kvserver/closed_timestamp_test.go, line 908 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This function takes in a dbName but hardcodes the table name. Is this useful? I think if should go one way or the other.

added a tableName param.


pkg/kv/kvserver/closed_timestamp_test.go, line 933 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Have you considered using alter table split at (2)?

yes but that doesn't really make it any cleaner, right? we'd still have to hustle a bit to get the leftDesc and rightDesc rangeDescriptors unless I'm missing some obvious utility method.


pkg/kv/kvserver/closed_timestamp_test.go, line 943 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why don't we do this on ReplicationManual? Do we depend on any queues here?

WaitForFullReplication() forces a scan on the replicate queue, which doesn't work with the queue disabled (because it calls maybeAdd into the queue).


pkg/kv/kvserver/closed_timestamp_test.go, line 1185 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1

you're right, I dont remember why I did that. Fixed.


pkg/kv/kvserver/closed_timestamp_test.go, line 1195 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

whoa I didn't know this was a thing

glad to be of service


pkg/kv/kvserver/merge_queue.go, line 294 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: VOTER_FULL and NON_VOTER replicas
:)

Done.


pkg/kv/kvserver/merge_queue.go, line 298 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If I am to read anything into the "defensive sanity check" part, this should be an AssertionError? Same below.

Done.


pkg/kv/kvserver/merge_queue.go, line 299 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we use the same new language about "when lhs is in a joint state" here?

Changed.


pkg/kv/kvserver/merge_queue.go, line 314 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's rename replicaSetsEqual to something that makes it more clear that the replica types don't matter. For example replicasCollocated.

Done.


pkg/kv/kvserver/merge_queue.go, line 314 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please give this whole block a comment about what it'll do it !replicaSetsEqual

Added a method TargetsToCollocateRHSDuringMerge that should have the comment you want.


pkg/kv/kvserver/merge_queue.go, line 338 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

So then the example we should be discussing is something like

lhs: voters = {1}, non-voters = {2,3}
rhs: voters = {2}, non-voters = {1,4}

It seems to me that we should make it such that only one replica needs to change here, and the voters for the lhs and rhs can stay as they are.
Btw, it'd be nice if we put a comment around here about what the conditions for the merge are. Say that voters vs non-voters doesn't matter, and say that the leaseholders can be on different nodes (right?).

I've revamped the logic here. We now only relocate replicas that are not collocated between LHS and RHS. PTAL.


pkg/kv/kvserver/merge_queue.go, line 341 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't really understand what this comment is trying to tell me.
Can we make the "We only care about merging ranges that are aligned ..." part more precise? Who's we and what happens if this code is asked to merge ranges with different constraints? What does it mean for a range (or for a pair of ranges?) to be "aligned with regards to their zone config constraints"? If what you want to say is that this isn't called for ranges with different zone configs, then I think it'd be much clearer to say exactly that.

And the I don't understand the "in practice" and "likely" parts :). What more could we care about?

Removed the comment, PTAL the comment over TargetsToCollocateRHSDuringMerge.


pkg/kv/kvserver/replica_command.go, line 2128 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Any reason to do the non-voters first?

I can't particularly rationalize doing either first. Can you think of anything that supports relocating one set over the other?


pkg/kv/kvserver/replica_command.go, line 2257 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we use the new predicate you added here?

Done.


pkg/kv/kvserver/replica_rangefeed_test.go, line 117 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1
If for whatever reason you want to leave it like this, at least make a const NeverExpires = hlc.MaxTimestamp and use it in all the callers to make them self-documenting.

Added a SplitRangeWithExpiration.


pkg/kv/kvserver/testing_knobs.go, line 238 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this intentional? It doesn't look like the comment was more than 80 chars.

I had renamed the knob to something else and then changed it back. Fixed the wrapping now.


pkg/roachpb/metadata_replicas.go, line 259 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Similar to the comment on ReplicaDescriptors.Voters, it seems worthwhile to explain the meaning of this subset of replicas here. Why is this combination special?

Done. Let me know if it seems lacking.


pkg/sql/relocate.go, line 135 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we just pass nil here and avoid the allocation? Same thing in a few places.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 31 of 31 files at r3, 50 of 50 files at r4, 19 of 19 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/closed_timestamp_test.go, line 1158 at r3 (raw file):

// corresponding to this table.
func setupTestClusterWithDummyRange(
	t *testing.T, clusterArgs base.TestClusterArgs, dbName string, tableName string, numNodes int,

nit: dbName, tableName string


pkg/kv/kvserver/merge_queue.go, line 374 at r5 (raw file):

}

// TargetsToCollocateRHSDuringMerge computes the stores which the voters and

Does this need to be exported? Also, should it live next to replicasCollocated?


pkg/kv/kvserver/merge_queue.go, line 383 at r5 (raw file):

// why or why not.
func TargetsToCollocateRHSDuringMerge(
	leftRepls roachpb.ReplicaSet, rightRepls roachpb.ReplicaSet,

nit: leftRepls, rightRepls roachpb.ReplicaSet


pkg/kv/kvserver/merge_queue.go, line 384 at r5 (raw file):

func TargetsToCollocateRHSDuringMerge(
	leftRepls roachpb.ReplicaSet, rightRepls roachpb.ReplicaSet,
) ([]roachpb.ReplicationTarget, []roachpb.ReplicationTarget) {

nit: give these names. That will allow you to eliminate the duplicate type


pkg/kv/kvserver/merge_queue.go, line 391 at r5 (raw file):

	// Sets of replicas that exist on the LHS but not on the RHS
	nonCollocated := leftRepls.Filter(notInRight)

I found it difficult to remember which of these 5 vars were referring to the LHS and which were referring to the RHS. Consider renaming to make that more clear


pkg/kv/kvserver/merge_queue.go, line 396 at r5 (raw file):

	// We bootstrap our result set by first including the replicas (voting and
	// non-voting) that _are_ collocated, as these will stay unchanged.

I always find it difficult to remember what a ReplicationTarget means if the range already has a replica there. Consider answering this question here by saying something like will stay unchanged and will result in a no-op when passed through AdminRelocateRange.


pkg/kv/kvserver/merge_queue.go, line 410 at r5 (raw file):

			finalVoters.AddReplica(nonCollocatedNonVoters[0])
			nonCollocatedNonVoters = nonCollocatedNonVoters[1:]
		}

What happens if we fall through here? Should this be possible? If so, would we loop indefinitely? If that's not possible, then we should add an assertion.


pkg/kv/kvserver/replica_command.go, line 2128 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I can't particularly rationalize doing either first. Can you think of anything that supports relocating one set over the other?

No strong reason, other than that this looks deliberate (because voters comes before non-voters everywhere else), which might trip up readers.


pkg/kv/kvserver/replica_command.go, line 2161 at r4 (raw file):

func checkDescsEqual(desc *roachpb.RangeDescriptor) func(*roachpb.RangeDescriptor) bool {
	// TODO(jeffreyxiao): This hacky fix ensures that we don't fail the

What changed to make this needed again?


pkg/kv/kvserver/replica_command.go, line 2574 at r5 (raw file):

}

func diff(first, second []roachpb.ReplicationTarget ) (diff []roachpb.ReplicaDescriptor) {

Let's give this a better name. Having a diff function in the pkg/kv/kvserver namespace is going to be pretty confusing.

Also, give it a quick explanation. What's being diffed? How do first and second relate? What's the result?


pkg/roachpb/metadata_replicas.go, line 259 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Done. Let me know if it seems lacking.

Isn't this also the set that we allow follower reads to evaluate on? If so, maybe talk a bit about that.

Or does that set also include VOTER_INCOMING?


pkg/roachpb/metadata_replicas.go, line 55 at r4 (raw file):

}

// ReplicaSet is a set of replicas, usually the nodes/stores on which

This all seems fine, but it's also a huge diff. Do you mind talking about the motivation a bit more? You said "This is because the current naming makes it hard/confusing to add new methods that return another ReplicaDescriptors instead of the underlying slice.". Could you give an example of where this became awkward?


pkg/roachpb/metadata_replicas.go, line 64 at r4 (raw file):

// descriptors.
//
// All construction of ReplicaSet is required to go through this method

nit: rewrap this line


pkg/roachpb/metadata_replicas.go, line 338 at r5 (raw file):

// Contains returns true if the set contains rDesc.
func (d ReplicaSet) Contains(rDesc ReplicaDescriptor) bool {
	for _, repl := range d.Descriptors() {

We should avoid the memory copy of the ReplicaDescriptor on each iteration. Instead, use

descs := d.Descriptors()
for i := range descs {
    repl := &descs[i]
    ...
}

pkg/roachpb/metadata_replicas.go, line 469 at r5 (raw file):

func (d ReplicaSet) ReplicationTargets() (out []ReplicationTarget) {
	for _, r := range d.Descriptors() {
		out = append(out, ReplicationTarget{NodeID: r.NodeID, StoreID: r.StoreID})

Let's pre-allocate the slice's length so we're not wasting work.

And let's avoid the descriptor copy, as discussed above.


pkg/sql/relocate.go, line 135 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Done.

For this one, give it the /* ... */ treatment.

@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch from 1247a2f to 984795f Compare December 16, 2020 08:27
This commit extracts some existing testing methods to be more general.
so they can be reused for testing merge queue behavior with non-voting
replicas.

Release note: None
This commit renames `ReplicaDescriptors`, which provides some utility
methods over a `[]ReplicaDescriptor`. This is because the current naming
makes it hard/confusing to add new methods that return another
`ReplicaDescriptors` instead of the underlying slice.

Release note: None
This commit teaches `RelocateRange` to work with non-voters. This lets
the merge queue rebalance a range that has non-voters so the merge can
actually proceed, as range merges require that replica sets of the LHS
and RHS ranges must be aligned.

Release note: None
@aayushshah15 aayushshah15 force-pushed the relocate-non-voting-replicas2 branch from 9a5ef6f to bfe698a Compare January 7, 2021 18:23
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/replica_command.go, line 2182 at r20 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: consider _ error at the end. You're not using the err variable, and introducing it in scope can only hurt. You're also not using the other named returns, but for those having the name in the signature is beneficial for documentation (whereas for the error it's not)

Done.


pkg/kv/kvserver/replica_command.go, line 2203 at r20 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: loop through the set

Done.

@craig
Copy link
Contributor

craig bot commented Jan 7, 2021

Build succeeded:

@craig craig bot merged commit c59186a into cockroachdb:master Jan 7, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jan 13, 2021
Before this patch, cockroachdb#56197 broke this test because it changed a range
merge error string that let the KV nemesis validator ignore the error.
This patch updates the regexp the validator uses to match the error and
unskips the test.

Fixes cockroachdb#58624.

Release note: None
nvanbenschoten pushed a commit to nvanbenschoten/cockroach that referenced this pull request Jan 14, 2021
Before this patch, cockroachdb#56197 broke this test because it changed a range
merge error string that let the KV nemesis validator ignore the error.
This patch updates the regexp the validator uses to match the error and
unskips the test.

Fixes cockroachdb#58624.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 14, 2021
58888: sql, jobs: stop queuing schema change jobs for in-txn schema changes r=postamar a=postamar

Creating a table and changing its schema within a transaction would
cause a schema change job to be queued. Such jobs are not necessary and
don't do anything. This patch prevents them from being queued in the
first place.

Fixes #45985.

Release note (sql change): Creating a table and changing its schema
within a transaction no longer schedules a schema change job.

58919: kvnemesis: fix and unskip TestKVNemesisMultiNode r=aayushshah15 a=aayushshah15

Before this patch, #56197 broke this test because it changed a range
merge error string that let the KV nemesis validator ignore the error.
This patch updates the regexp the validator uses to match the error and
unskips the test.

Fixes #58624.

Release note: None

58982: delegate: remove is_active_region from SHOW REGIONS FROM DATABASE r=ajstorm a=otan

Due to unpopular demand.

Release note: None

Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 26, 2021
This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in cockroachdb#56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before cockroachdb#58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 28, 2021
This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in cockroachdb#56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before cockroachdb#58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 29, 2021
This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in cockroachdb#56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before cockroachdb#58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 29, 2021
This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in cockroachdb#56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before cockroachdb#58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 29, 2021
62631: kvserver: get rid of bespoke relocation logic used by the mergeQueue r=aayushshah15 a=aayushshah15

This PR contains two main commits:

**kvserver: update `AdminRelocateRange` to leverage explicit swaps of
voters to non-voters**

This commit updates `AdminRelocateRange` to use explicit atomic swaps of
voting replicas with non-voting replicas, that #58627 initially added
support for. The patch does so by generalizing behavior that's already
exercised by the `replicateQueue` when it decides to rebalance replicas.
See #61239.

This allows us, in the next commit, to remove bespoke relocation logic
that's used by the `mergeQueue` to align replica sets for the sake of a
range merge.

Release note: None

**kvserver: get rid of bespoke relocation logic used by the mergeQueue**

This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in #56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before #58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Resolves #62370

Release note: None


Co-authored-by: aayush <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 29, 2021
This commit removes the relocation logic used by the `mergeQueue` thus
far to align replica sets (added in cockroachdb#56197). This logic previously
existed in order to allow us to align the replica sets of a pair of
ranges (which is required for the range merge to proceed), while
avoiding redundant data movement.

Before cockroachdb#58627 and the previous commit in this PR, `AdminRelocateRange`
couldn't be directly used by the mergeQueue under various configurations
of the LHS and RHS ranges. Furthermore, even when it could be used, it
would involve redundant data movement. This all required us to compute
relocation targets for the RHS of a merge separately, above the call to
`AdminRelocateRange`, for the range merge to proceed.

All these limitations have been resolved by the previous commit which
teaches `AdminRelocateRange` to promote non-voters and demote voters
when needed, and the aforementioned bespoke relocation logic is no
longer needed.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants