-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: issue swaps on AllocatorConsiderRebalance #40284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/allocator.go, line 402 at r2 (raw file):
// The dead replica(s) will be down-replicated later. priority := addDeadReplacementPriority log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d dead replicas priority=%.2f",
AllocatorReplaceDead
pkg/storage/allocator.go, line 410 at r2 (raw file):
// Range has decommissioning replica(s), which should be replaced. priority := addDecommissioningReplacementPriority log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d decommissioning replicas priority=%.2f",
AllocatorReplaceDecommissioning
pkg/storage/allocator_test.go, line 5023 at r2 (raw file):
action, _ := a.ComputeAction(ctx, zone, &desc) require.Equal(t, c.expectedAction.String(), action.String())
Why do we need the .String()
?
pkg/storage/replicate_queue.go, line 344 at r2 (raw file):
existingReplicas := liveVoterReplicas // WIP(tbg): pass a slice of replicas that can be replaced in. // In ReplaceDead, the it's the dead replicas, in ReplaceDecommissioning
s/the it's/it's/
pkg/storage/replicate_queue.go, line 423 at r5 (raw file):
ctx, repl, []roachpb.ReplicationChange{{Target: newReplica, ChangeType: roachpb.ADD_REPLICA}},
Can we not use roachpb.MakeReplicationChanges
here and below?
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
not taking into account the replica we're going to remove
I must be missing how we're ensuring this. findRemoveTarget
doesn't mutate existingReplicas
, does it?
pkg/storage/replicate_queue_test.go, line 140 at r6 (raw file):
// Query the range log to see if anything unexpected happened. Concretely, // we'll make sure that our tracked ranges never had >=3 replicas.
> 3 replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/allocator_test.go, line 5023 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why do we need the
.String()
?
It was printed as %d
for some reason.
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
not taking into account the replica we're going to remove
I must be missing how we're ensuring this.
findRemoveTarget
doesn't mutateexistingReplicas
, does it?
We're removing one from existingReplicas
and are adding one that is not in existingReplicas
, hence they're not the same, so I think the comment is true-- but looking into this I saw that I'm reinventing the wheel:
cockroach/pkg/storage/allocator.go
Lines 708 to 715 in 8d8e8ff
removeReplica, removeDetails, err := a.simulateRemoveTarget( | |
ctx, | |
target.store.StoreID, | |
zone, | |
replicaCandidates, | |
replicaCandidates, | |
rangeUsageInfo, | |
) |
I should return both the removal and addition target from `RebalanceTarget.
b6d5c9c
to
6315bd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 4 of 4 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
We're removing one from existingReplicas and are adding one that is not in existingReplicas
Yeah, but where are we doing that?
Oh, unless the point is that existingReplicas
still contains removeReplica
when passed to RebalanceTarget
, in which case the "not taking into account the replica we're going to remove" is easy to misinterpret as the opposite of what it's intending to say.
I should return both the removal and addition target from `RebalanceTarget.
You're doing that now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
You're doing that now?
Yep, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/allocator_test.go, line 609 at r12 (raw file):
{ var rangeUsageInfo RangeUsageInfo target, details, _, ok := a.RebalanceTarget(
Is details
in the wrong position?
pkg/storage/allocator_test.go, line 855 at r12 (raw file):
sg.GossipStores(stores, t) for i := 0; i < 10; i++ { target, details, _, ok := a.RebalanceTarget(
Same question.
pkg/storage/allocator_test.go, line 873 at r12 (raw file):
sg.GossipStores(stores, t) for i := 0; i < 10; i++ { target, details, _, ok := a.RebalanceTarget(
Mind checking all of these?
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You're doing that now?
Yep, done
We don't need to transfer the lease away from removeTarget
?
pkg/storage/replicate_queue.go, line 697 at r12 (raw file):
// rebalance. Attempt to find a rebalancing target. if !rq.store.TestingKnobs().DisableReplicaRebalancing { // Then, not taking into account the replica we're going to remove, see
This comment is strange now, since this is the only step.
36bbb63
to
8c66ec7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We don't need to transfer the lease away from
removeTarget
?
Good catch, we do. Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r13.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replicate_queue.go, line 705 at r13 (raw file):
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil { log.VEventf(ctx, 1, "want to remove self, but failed to transfer lease away: %s", err) } else if !done {
nit: I'd do this like:
if !ok {
...
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
...
} else if done {
// Lease is now elsewhere, so we're not in charge any more.
return false, nil
} else {
...
}
This gives you a place to add a comment.
8c66ec7
to
1121c4c
Compare
Down from 10s to the usual 3-4s for this type of test. Release note: None
Teach the allocator that there is a concept of "replacing" dead and decommissioning replicas. Add a stub handler for them to the replicate queue that does exactly what AllocatorAdd does, i.e., maintain the previous behavior. A follow-up commit will then teach the replicate queue to actually issue replica replacement operations both for these two new options as well as in AllocatorConsiderRebalance. When that is established, we should be in a place where the replicate queue never puts ranges into fragile intermediate configurations except that it will still transition through even configurations when upreplicating (which can be tackled separately). Release note: None
We'll need this to find replacements when rebalancing. Release note: None
This code repeats too much, and we'll need it in more places soon. Release note: None
Release note: None
Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note)
RebalanceTarget was internally already picking out a replica that could be removed. This is now returned to the caller and actually removed during rebalancing. Release note: None
1121c4c
to
f94a83e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=nvanbenschoten
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes #39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory. 40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note) 40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section. This was responsible for 8.15% of mutex contention on a YCSB run. The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO. 40301: roachtest: Deflake clock jump test r=tbg a=bdarnell These tests perform various clock jumps, then reverse them. The reverse can cause a crash even if the original jump did not. Add some sleeps to make things more deterministic and improve the recovery process at the end of the test. Fixes #38723 Release note: None 40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss The runTests helper will now cause a panic if a vectorized operator tries to access a part of the selection vector that is out of bounds. This identified bugs in the projection operator. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Build succeeded |
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg As of #40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until #40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.
Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)