Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: issue swaps on AllocatorConsiderRebalance #40284

Merged
merged 7 commits into from
Aug 28, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 28, 2019

Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.

Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)

@tbg tbg requested a review from nvanbenschoten August 28, 2019 14:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/allocator.go, line 402 at r2 (raw file):

		// The dead replica(s) will be down-replicated later.
		priority := addDeadReplacementPriority
		log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d dead replicas priority=%.2f",

AllocatorReplaceDead


pkg/storage/allocator.go, line 410 at r2 (raw file):

		// Range has decommissioning replica(s), which should be replaced.
		priority := addDecommissioningReplacementPriority
		log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d decommissioning replicas priority=%.2f",

AllocatorReplaceDecommissioning


pkg/storage/allocator_test.go, line 5023 at r2 (raw file):

				action, _ := a.ComputeAction(ctx, zone, &desc)
				require.Equal(t, c.expectedAction.String(), action.String())

Why do we need the .String()?


pkg/storage/replicate_queue.go, line 344 at r2 (raw file):

		existingReplicas := liveVoterReplicas
		// WIP(tbg): pass a slice of replicas that can be replaced in.
		// In ReplaceDead, the it's the dead replicas, in ReplaceDecommissioning

s/the it's/it's/


pkg/storage/replicate_queue.go, line 423 at r5 (raw file):

		ctx,
		repl,
		[]roachpb.ReplicationChange{{Target: newReplica, ChangeType: roachpb.ADD_REPLICA}},

Can we not use roachpb.MakeReplicationChanges here and below?


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

not taking into account the replica we're going to remove

I must be missing how we're ensuring this. findRemoveTarget doesn't mutate existingReplicas, does it?


pkg/storage/replicate_queue_test.go, line 140 at r6 (raw file):

	// Query the range log to see if anything unexpected happened. Concretely,
	// we'll make sure that our tracked ranges never had >=3 replicas.

> 3 replicas

@tbg tbg requested a review from nvanbenschoten August 28, 2019 15:30
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/allocator_test.go, line 5023 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do we need the .String()?

It was printed as %d for some reason.


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
not taking into account the replica we're going to remove

I must be missing how we're ensuring this. findRemoveTarget doesn't mutate existingReplicas, does it?

We're removing one from existingReplicas and are adding one that is not in existingReplicas, hence they're not the same, so I think the comment is true-- but looking into this I saw that I'm reinventing the wheel:

removeReplica, removeDetails, err := a.simulateRemoveTarget(
ctx,
target.store.StoreID,
zone,
replicaCandidates,
replicaCandidates,
rangeUsageInfo,
)

I should return both the removal and addition target from `RebalanceTarget.

@tbg tbg force-pushed the atomic/replicatequeue branch from b6d5c9c to 6315bd3 Compare August 28, 2019 15:30
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 4 of 4 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

We're removing one from existingReplicas and are adding one that is not in existingReplicas

Yeah, but where are we doing that?

Oh, unless the point is that existingReplicas still contains removeReplica when passed to RebalanceTarget, in which case the "not taking into account the replica we're going to remove" is easy to misinterpret as the opposite of what it's intending to say.

I should return both the removal and addition target from `RebalanceTarget.

You're doing that now?

@tbg tbg requested a review from nvanbenschoten August 28, 2019 16:31
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

You're doing that now?

Yep, done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r12.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/allocator_test.go, line 609 at r12 (raw file):

		{
			var rangeUsageInfo RangeUsageInfo
			target, details, _, ok := a.RebalanceTarget(

Is details in the wrong position?


pkg/storage/allocator_test.go, line 855 at r12 (raw file):

	sg.GossipStores(stores, t)
	for i := 0; i < 10; i++ {
		target, details, _, ok := a.RebalanceTarget(

Same question.


pkg/storage/allocator_test.go, line 873 at r12 (raw file):

	sg.GossipStores(stores, t)
	for i := 0; i < 10; i++ {
		target, details, _, ok := a.RebalanceTarget(

Mind checking all of these?


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You're doing that now?

Yep, done

We don't need to transfer the lease away from removeTarget?


pkg/storage/replicate_queue.go, line 697 at r12 (raw file):

	// rebalance. Attempt to find a rebalancing target.
	if !rq.store.TestingKnobs().DisableReplicaRebalancing {
		// Then, not taking into account the replica we're going to remove, see

This comment is strange now, since this is the only step.

@tbg tbg force-pushed the atomic/replicatequeue branch 2 times, most recently from 36bbb63 to 8c66ec7 Compare August 28, 2019 18:39
@tbg tbg requested a review from nvanbenschoten August 28, 2019 18:39
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replicate_queue.go, line 717 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't need to transfer the lease away from removeTarget?

Good catch, we do. Added.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r13.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replicate_queue.go, line 705 at r13 (raw file):

		} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
			log.VEventf(ctx, 1, "want to remove self, but failed to transfer lease away: %s", err)
		} else if !done {

nit: I'd do this like:

if !ok {
    ...
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
    ...
} else if done {
    // Lease is now elsewhere, so we're not in charge any more.
    return false, nil
} else {
    ...
}

This gives you a place to add a comment.

@tbg tbg force-pushed the atomic/replicatequeue branch from 8c66ec7 to 1121c4c Compare August 28, 2019 20:10
tbg added 7 commits August 28, 2019 22:11
Down from 10s to the usual 3-4s for this type of test.

Release note: None
Teach the allocator that there is a concept of "replacing" dead and
decommissioning replicas.
Add a stub handler for them to the replicate queue that does exactly what
AllocatorAdd does, i.e., maintain the previous behavior.

A follow-up commit will then teach the replicate queue to actually issue
replica replacement operations both for these two new options as well as
in AllocatorConsiderRebalance.

When that is established, we should be in a place where the replicate
queue never puts ranges into fragile intermediate configurations except
that it will still transition through even configurations when
upreplicating (which can be tackled separately).

Release note: None
We'll need this to find replacements when rebalancing.

Release note: None
This code repeats too much, and we'll need it in more places soon.

Release note: None
Change the rebalancing code so that it not only looks up a new replica
to add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long
as that feature is enabled.

Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone
constraints. Violations may still occur during zone config changes,
decommissioning, and in the presence of dead nodes (NB: the remainder be
addressed in a future change, so merge the corresponding release note)
RebalanceTarget was internally already picking out a replica that could
be removed. This is now returned to the caller and actually removed
during rebalancing.

Release note: None
@tbg tbg force-pushed the atomic/replicatequeue branch from 1121c4c to f94a83e Compare August 28, 2019 20:36
@tbg tbg requested a review from nvanbenschoten August 28, 2019 20:55
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=nvanbenschoten

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

craig bot pushed a commit that referenced this pull request Aug 28, 2019
40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon

In lookup joins on partial index keys, there is no limit on how many
rows might be returned by any particular lookup, so the joinreader may
be buffering an unbounded number of rows into memory. I changed
joinreader to use a disk-backed row container rather than just storing
the rows in memory with no accounting.

Fixes #39044

Release note (bug fix): Lookup joins now spill to disk if the index
lookups return more rows than can be stored in memory.

40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg

Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.

Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)

40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten

The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section.

This was responsible for 8.15% of mutex contention on a YCSB run.

The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO.

40301: roachtest: Deflake clock jump test r=tbg a=bdarnell

These tests perform various clock jumps, then reverse them. The
reverse can cause a crash even if the original jump did not.
Add some sleeps to make things more deterministic and improve the
recovery process at the end of the test.

Fixes #38723

Release note: None

40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss

The runTests helper will now cause a panic if a vectorized operator
tries to access a part of the selection vector that is out of bounds.
This identified bugs in the projection operator.

Release note: None

Co-authored-by: Solon Gordon <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 28, 2019

Build succeeded

@craig craig bot merged commit f94a83e into cockroachdb:master Aug 28, 2019
tbg added a commit to tbg/cockroach that referenced this pull request Aug 30, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 3, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 3, 2019
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing.
TestInitialPartitioning helpfully points out (once you flip atomic
rebalancing on) that when the replication factor is one, there is
no way to perform such an atomic swap because it will necessarily
have to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\cockroachdb#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and
hopefully the lease will be transferred over and then the original
leaseholder removed. I would be very doubtful that this all works,
but it is how things worked until cockroachdb#40284, so this PR really just
falls back to the previous behavior in cases where we can't do
better.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 3, 2019
40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg

As of #40284, the replicate queue was issuing swaps (atomic add+remove)
during rebalancing. TestInitialPartitioning helpfully points out (once you
flip atomic rebalancing on) that when the replication factor is one, there
is no way to perform such an atomic swap because it will necessarily have
to remove the leaseholder.

To work around this restriction (which, by the way, we dislike - see
\#40333), fall back to just adding a replica in this case without also
removing one. In the next scanner cycle (which should happen immediately
since we requeue the range) the range will be over-replicated and hopefully
the lease will be transferred over and then the original leaseholder
removed. I would be very doubtful that this all works, but it is how things
worked until #40284, so this PR really just falls back to the previous
behavior in cases where we can't do better.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants