storage: issue swaps on AllocatorConsiderRebalance #40284

tbg · 2019-08-28T14:29:15Z

Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.

Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)

cockroach-teamcity · 2019-08-28T14:29:28Z

This change is

nvanbenschoten

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/allocator.go, line 402 at r2 (raw file):

		// The dead replica(s) will be down-replicated later.
		priority := addDeadReplacementPriority
		log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d dead replicas priority=%.2f",

AllocatorReplaceDead

pkg/storage/allocator.go, line 410 at r2 (raw file):

		// Range has decommissioning replica(s), which should be replaced.
		priority := addDecommissioningReplacementPriority
		log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d decommissioning replicas priority=%.2f",

AllocatorReplaceDecommissioning

pkg/storage/allocator_test.go, line 5023 at r2 (raw file):

				action, _ := a.ComputeAction(ctx, zone, &desc)
				require.Equal(t, c.expectedAction.String(), action.String())

Why do we need the .String()?

pkg/storage/replicate_queue.go, line 344 at r2 (raw file):

		existingReplicas := liveVoterReplicas
		// WIP(tbg): pass a slice of replicas that can be replaced in.
		// In ReplaceDead, the it's the dead replicas, in ReplaceDecommissioning

s/the it's/it's/

pkg/storage/replicate_queue.go, line 423 at r5 (raw file):

		ctx,
		repl,
		[]roachpb.ReplicationChange{{Target: newReplica, ChangeType: roachpb.ADD_REPLICA}},

Can we not use roachpb.MakeReplicationChanges here and below?

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

not taking into account the replica we're going to remove

I must be missing how we're ensuring this. findRemoveTarget doesn't mutate existingReplicas, does it?

pkg/storage/replicate_queue_test.go, line 140 at r6 (raw file):

	// Query the range log to see if anything unexpected happened. Concretely,
	// we'll make sure that our tracked ranges never had >=3 replicas.

> 3 replicas

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/allocator_test.go, line 5023 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do we need the .String()?

It was printed as %d for some reason.

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

not taking into account the replica we're going to remove
I must be missing how we're ensuring this. findRemoveTarget doesn't mutate existingReplicas, does it?

We're removing one from existingReplicas and are adding one that is not in existingReplicas, hence they're not the same, so I think the comment is true-- but looking into this I saw that I'm reinventing the wheel:

cockroach/pkg/storage/allocator.go

Lines 708 to 715 in 8d8e8ff

    
           removeReplica, removeDetails, err := a.simulateRemoveTarget( 
        
           	ctx, 
        
           	target.store.StoreID, 
        
           	zone, 
        
           	replicaCandidates, 
        
           	replicaCandidates, 
        
           	rangeUsageInfo, 
        
           )

I should return both the removal and addition target from `RebalanceTarget.

nvanbenschoten

Reviewed 5 of 5 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 4 of 4 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

We're removing one from existingReplicas and are adding one that is not in existingReplicas

Yeah, but where are we doing that?

Oh, unless the point is that existingReplicas still contains removeReplica when passed to RebalanceTarget, in which case the "not taking into account the replica we're going to remove" is easy to misinterpret as the opposite of what it's intending to say.

I should return both the removal and addition target from `RebalanceTarget.

You're doing that now?

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

You're doing that now?

Yep, done

nvanbenschoten

Reviewed 1 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/storage/allocator_test.go, line 609 at r12 (raw file):

		{
			var rangeUsageInfo RangeUsageInfo
			target, details, _, ok := a.RebalanceTarget(

Is details in the wrong position?

pkg/storage/allocator_test.go, line 855 at r12 (raw file):

	sg.GossipStores(stores, t)
	for i := 0; i < 10; i++ {
		target, details, _, ok := a.RebalanceTarget(

Same question.

pkg/storage/allocator_test.go, line 873 at r12 (raw file):

	sg.GossipStores(stores, t)
	for i := 0; i < 10; i++ {
		target, details, _, ok := a.RebalanceTarget(

Mind checking all of these?

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You're doing that now?

Yep, done

We don't need to transfer the lease away from removeTarget?

pkg/storage/replicate_queue.go, line 697 at r12 (raw file):

	// rebalance. Attempt to find a rebalancing target.
	if !rq.store.TestingKnobs().DisableReplicaRebalancing {
		// Then, not taking into account the replica we're going to remove, see

This comment is strange now, since this is the only step.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't need to transfer the lease away from removeTarget?

Good catch, we do. Added.

nvanbenschoten

Reviewed 2 of 2 files at r13.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/replicate_queue.go, line 705 at r13 (raw file):

		} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
			log.VEventf(ctx, 1, "want to remove self, but failed to transfer lease away: %s", err)
		} else if !done {

nit: I'd do this like:

if !ok {
    ...
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
    ...
} else if done {
    // Lease is now elsewhere, so we're not in charge any more.
    return false, nil
} else {
    ...
}

This gives you a place to add a comment.

Down from 10s to the usual 3-4s for this type of test. Release note: None

Teach the allocator that there is a concept of "replacing" dead and decommissioning replicas. Add a stub handler for them to the replicate queue that does exactly what AllocatorAdd does, i.e., maintain the previous behavior. A follow-up commit will then teach the replicate queue to actually issue replica replacement operations both for these two new options as well as in AllocatorConsiderRebalance. When that is established, we should be in a place where the replicate queue never puts ranges into fragile intermediate configurations except that it will still transition through even configurations when upreplicating (which can be tackled separately). Release note: None

We'll need this to find replacements when rebalancing. Release note: None

This code repeats too much, and we'll need it in more places soon. Release note: None

Release note: None

Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note)

RebalanceTarget was internally already picking out a replica that could be removed. This is now returned to the caller and actually removed during rebalancing. Release note: None

tbg

TFTR!

bors r=nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes #39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory. 40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note) 40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section. This was responsible for 8.15% of mutex contention on a YCSB run. The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO. 40301: roachtest: Deflake clock jump test r=tbg a=bdarnell These tests perform various clock jumps, then reverse them. The reverse can cause a crash even if the original jump did not. Add some sleeps to make things more deterministic and improve the recovery process at the end of the test. Fixes #38723 Release note: None 40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss The runTests helper will now cause a panic if a vectorized operator tries to access a part of the selection vector that is out of bounds. This identified bugs in the projection operator. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

craig · 2019-08-28T21:48:50Z

Build succeeded

GitHub CI (Cockroach)

As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None

40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg As of #40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until #40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

tbg requested a review from nvanbenschoten August 28, 2019 14:29

nvanbenschoten reviewed Aug 28, 2019

View reviewed changes

tbg requested a review from nvanbenschoten August 28, 2019 15:30

tbg commented Aug 28, 2019

View reviewed changes

tbg force-pushed the atomic/replicatequeue branch from b6d5c9c to 6315bd3 Compare August 28, 2019 15:30

nvanbenschoten reviewed Aug 28, 2019

View reviewed changes

tbg requested a review from nvanbenschoten August 28, 2019 16:31

tbg commented Aug 28, 2019

View reviewed changes

nvanbenschoten reviewed Aug 28, 2019

View reviewed changes

tbg force-pushed the atomic/replicatequeue branch 2 times, most recently from 36bbb63 to 8c66ec7 Compare August 28, 2019 18:39

tbg requested a review from nvanbenschoten August 28, 2019 18:39

tbg commented Aug 28, 2019

View reviewed changes

nvanbenschoten approved these changes Aug 28, 2019

View reviewed changes

tbg force-pushed the atomic/replicatequeue branch from 8c66ec7 to 1121c4c Compare August 28, 2019 20:10

tbg added 7 commits August 28, 2019 22:11

storage: speed up TestLargeUnsplittableRangeReplicate

6a7d125

Down from 10s to the usual 3-4s for this type of test. Release note: None

storage: factor out findRemoveTarget

da09a35

We'll need this to find replacements when rebalancing. Release note: None

storage: factor out findTargetAndTransferLease

32379fb

This code repeats too much, and we'll need it in more places soon. Release note: None

storage: allow passing >1 change to replicateQueue.add

47e7861

Release note: None

storage: improve replica swap computation

f94a83e

RebalanceTarget was internally already picking out a replica that could be removed. This is now returned to the caller and actually removed during rebalancing. Release note: None

tbg force-pushed the atomic/replicatequeue branch from 1121c4c to f94a83e Compare August 28, 2019 20:36

tbg requested a review from nvanbenschoten August 28, 2019 20:55

tbg commented Aug 28, 2019

View reviewed changes

craig bot merged commit f94a83e into cockroachdb:master Aug 28, 2019

tbg mentioned this pull request Aug 30, 2019

storage: work around can't-swap-leaseholder #40363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: issue swaps on AllocatorConsiderRebalance #40284

storage: issue swaps on AllocatorConsiderRebalance #40284

tbg commented Aug 28, 2019

cockroach-teamcity commented Aug 28, 2019

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

craig bot commented Aug 28, 2019

	removeReplica, removeDetails, err := a.simulateRemoveTarget(
	ctx,
	target.store.StoreID,
	zone,
	replicaCandidates,
	replicaCandidates,
	rangeUsageInfo,
	)

storage: issue swaps on AllocatorConsiderRebalance #40284

storage: issue swaps on AllocatorConsiderRebalance #40284

Conversation

tbg commented Aug 28, 2019

cockroach-teamcity commented Aug 28, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Aug 28, 2019

Build succeeded