-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: make lease rebalancing decisions at the store level #28340
Conversation
Tested primarily via tpc-c on roachprod and the new roachtest. I should give allocsim a look too but am going to start integrating replica-rebalancing decisions as a follow-up. |
05241f9
to
b3cbe23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I worry about the test being appropriate with KV.
Reviewed 3 of 3 files at r1, 8 of 8 files at r2, 3 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/rebalance_load.go, line 16 at r3 (raw file):
// for names of contributors. package main
So this is still checking with kv to make sure the leases are evenly distributed.
But the workload is evenly distributed.
Perhaps KV is not the right load for this and instead use something that stresses some parts more than others. And instead checks to see if QPS is evenly distributed. (This might be tough without load based rebalancing)
pkg/storage/allocator.go, line 1037 at r2 (raw file):
// rebalancing is enabled? In happy cases it's nice to keep this working // to even out the number of leases in addition to the number of replicas, // but it's certainly a blunt instrument that could undo what we want.
I think this might be a good idea. But of course, requires a lot of testing.
pkg/storage/replicate_queue.go, line 563 at r2 (raw file):
return false, nil } if err := rq.transferLease(ctx, repl, target); err != nil {
I think you can simplify this.
err := rq.transferLease(ctx, repl, target)
return err == nil, err
And reverse if case above from target != (roachpb.ReplicaDescriptor{})
to target == (roachpb.ReplicaDesciptor{})
and early exit. to avoid nesting.
pkg/storage/store_rebalancer.go, line 129 at r2 (raw file):
for localDesc.Capacity.QueriesPerSecond > qpsMaxThreshold { log.Infof(ctx, "considering load-based lease transfers for s%d with %.2f qps (mean=%.2f, upperThreshold=%2.f)",
Are these debug messages? Need to be venet 3 I think.
pkg/storage/store_rebalancer.go, line 152 at r2 (raw file):
// TODO(a-robinson): Should we take the number of leases on each store into // account here or just continue to let that happen in allocator.go?
How much do lease counts matter if we're balancing on qps anyway?
pkg/storage/store_rebalancer.go, line 187 at r2 (raw file):
} // TODO: Don't bother moving leases whose QPS is below some fraction of the
nit: format the todo with your name. If we still rebalance leases based on counts, this is needed. If we turn that off, then this is fine.
b3cbe23
to
f1e39e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/cmd/roachtest/rebalance_load.go, line 16 at r3 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
So this is still checking with kv to make sure the leases are evenly distributed.
But the workload is evenly distributed.
Perhaps KV is not the right load for this and instead use something that stresses some parts more than others. And instead checks to see if QPS is evenly distributed. (This might be tough without load based rebalancing)
Not quite -- this is checking that the 3 leases for the table are evenly distributed, not that all leases are evenly distributed. This was a very simple test that passes with this change and fails most of the time without it because the total number of ranges in a new cluster greatly exceeds the number of ranges (3) in the table used by kv
. Without load-based rebalancing, the leases for the kv
table will typically all just stay on the node that had the initial range lease before the ALTER TABLE ... SPLIT
command, since the other nodes have enough leases for other various ranges that the cluster is still considered balanced.
I agree that you could get slightly different test coverage with a couple other workloads, but in general all the tests I can think of would be testing that a minority of hot ranges get balanced properly, which this one does as well, just in a very simple way.
pkg/storage/allocator.go, line 1037 at r2 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
I think this might be a good idea. But of course, requires a lot of testing.
Yeah, I don't know that we have the time to fully remove this right now. We'll have to try out a couple antagonistic workloads before sticking with this, though.
pkg/storage/replicate_queue.go, line 563 at r2 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
I think you can simplify this.
err := rq.transferLease(ctx, repl, target) return err == nil, errAnd reverse if case above from
target != (roachpb.ReplicaDescriptor{})
totarget == (roachpb.ReplicaDesciptor{})
and early exit. to avoid nesting.
Done.
pkg/storage/store_rebalancer.go, line 129 at r2 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
Are these debug messages? Need to be venet 3 I think.
Sorry, I meant to have this outside the loop. Fixed.
I do think it'll be nice to actually log this at info level, since it'll only happen a max of once a minute and even then only on heavily loaded clusters with a significant qps imbalance.
pkg/storage/store_rebalancer.go, line 152 at r2 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
How much do lease counts matter if we're balancing on qps anyway?
There's still a (small) fixed cost per lease for things like running the replica through all the various queues. It makes the time it takes to drain a node or recover from a node crash more predictable. Also, it may seem silly, but it looks a lot better to new users if the number of leases on each node is roughly even.
It seems desirable to keep lease count roughly even. I'd be happy to hear counterarguments, though, since that would make things simpler. Maybe simplicity is reason enough.
pkg/storage/store_rebalancer.go, line 187 at r2 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
nit: format the todo with your name. If we still rebalance leases based on counts, this is needed. If we turn that off, then this is fine.
Thanks for calling this out. I create TODOs without a name when I want to make sure they're resolved before merging the code.
f1e39e5
to
a082410
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, and a big step forward.
Reviewed 11 of 11 files at r4, 9 of 9 files at r5, 3 of 3 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/rebalance_load.go, line 16 at r3 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Not quite -- this is checking that the 3 leases for the table are evenly distributed, not that all leases are evenly distributed. This was a very simple test that passes with this change and fails most of the time without it because the total number of ranges in a new cluster greatly exceeds the number of ranges (3) in the table used by
kv
. Without load-based rebalancing, the leases for thekv
table will typically all just stay on the node that had the initial range lease before theALTER TABLE ... SPLIT
command, since the other nodes have enough leases for other various ranges that the cluster is still considered balanced.I agree that you could get slightly different test coverage with a couple other workloads, but in general all the tests I can think of would be testing that a minority of hot ranges get balanced properly, which this one does as well, just in a very simple way.
SGTM!
pkg/storage/store_rebalancer.go, line 129 at r2 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Sorry, I meant to have this outside the loop. Fixed.
I do think it'll be nice to actually log this at info level, since it'll only happen a max of once a minute and even then only on heavily loaded clusters with a significant qps imbalance.
SGTM
pkg/storage/store_rebalancer.go, line 152 at r2 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
There's still a (small) fixed cost per lease for things like running the replica through all the various queues. It makes the time it takes to drain a node or recover from a node crash more predictable. Also, it may seem silly, but it looks a lot better to new users if the number of leases on each node is roughly even.
It seems desirable to keep lease count roughly even. I'd be happy to hear counterarguments, though, since that would make things simpler. Maybe simplicity is reason enough.
Less movement of leases is better overall. Which I think is a strong argument on it's own.
Anyway, not blocking for this PR. But good for a follow up.
pkg/storage/store_rebalancer.go, line 198 at r5 (raw file):
// just unnecessary churn with no benefit to move leases responsible for, // for example, 1 qps on a store with 5000 qps. const minQPSFraction = .001
Is it worth pulling this const up to highlight it outside of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/store_rebalancer.go, line 198 at r5 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
Is it worth pulling this const up to highlight it outside of the function?
It'd have to be given a more descriptive name if it was pulled outside, so I'd go with no.
cc @nvanbenschoten in case you wanted to take a look at this before submitting. Feel free to pass on doing so, but I wanted to give you a chance if you're interested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @a-robinson! I'm really excited that we're getting this in for 2.1.
Reviewed 2 of 3 files at r1, 11 of 11 files at r4, 9 of 9 files at r5, 3 of 3 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained
docs/generated/settings/settings.html, line 21 at r5 (raw file):
<tr><td><code>kv.allocator.load_based_lease_rebalancing.enabled</code></td><td>boolean</td><td><code>true</code></td><td>set to enable rebalancing of range leases based on load and latency</td></tr> <tr><td><code>kv.allocator.range_rebalance_threshold</code></td><td>float</td><td><code>0.05</code></td><td>minimum fraction away from the mean a store's range count can be before it is considered overfull or underfull</td></tr> <tr><td><code>kv.allocator.stat_based_rebalancing.enabled</code></td><td>boolean</td><td><code>true</code></td><td>set to enable rebalancing of range replicas based on write load and disk usage</td></tr>
Did this get out of sync with the code?
pkg/cmd/roachtest/rebalance_load.go, line 16 at r3 (raw file):
Previously, BramGruneir (Bram Gruneir) wrote…
SGTM!
Had the same question as @BramGruneir. That reasoning checks out with me.
Probably worth describing what happens without kv.allocator.stat_based_rebalancing.enabled=true
somewhere in the test.
pkg/storage/allocator.go, line 44 at r5 (raw file):
// baseLoadBasedLeaseRebalanceThreshold is the equivalent of // leaseRebalanceThreshold for load-based lease rebalance decisions (i.e. // "follow-the-workload"). Its the base threshold for decisions that gets
decisions that get adjusted
pkg/storage/allocator.go, line 1011 at r5 (raw file):
// TODO(a-robinson): Should this be changed to avoid even thinking about lease // counts now that we try to spread leases and replicas based on QPS? As is it // may fight back a little bit against store-level QPS--based rebalancing.
double "-"
pkg/storage/replica_rankings.go, line 62 at r4 (raw file):
rr.mu.Lock() defer rr.mu.Unlock() if len(rr.mu.byQPS) == 0 && rr.mu.accumulator.qps.Len() > 0 {
The interaction between byQPS
and rrAccumulator
in this function is subtle. It looked very strange to me until I realized why you gave it these semantics (so that all 128 top replicas are considered and we don't end up staring at the top few every time the StoreRebalancer runs, right?). I'd call that out somewhere. Also call out that because of this behavior we have to be careful about increasing numTopReplicasToTrack
or we could end up adding large delays between when the hottest few replicas are considered for rebalancing.
pkg/storage/replica_rankings.go, line 75 at r4 (raw file):
func (a *rrAccumulator) addReplica(repl replicaWithStats) { heap.Push(&a.qps, repl) if a.qps.Len() > numTopReplicasToTrack {
This is almost always going to be the case during steady-state operations, right? In that case, it seems like there's a natural fast-path that falls out where we compare against the head of the priority queue and only pop and push if the new replica will result in the head being evicted.
pkg/storage/replica_rankings.go, line 80 at r4 (raw file):
} type rrAccumulator struct {
I'd add a comment to this as well because it is used externally of the replicaRankings
abstraction. That will help justify why we have this extra layer of structure instead of having replicaRankings
hold an rrPriorityQueue
directly.
pkg/storage/store_rebalancer.go, line 45 at r5 (raw file):
// compares to the load on other stores in the cluster and transferring leases // or replicas away if the local store is overloaded. type StoreRebalancer struct {
There's an interesting reason why this can't be a queue itself. Might be worth adding that to this comment.
pkg/storage/store_rebalancer.go, line 208 at r5 (raw file):
desc := replWithStats.repl.Desc() log.VEventf(ctx, 3, "considering lease transfer for r%d with %.2f qps", desc.RangeID, replWithStats.qps) // TODO(a-robinson): Should we sort these to first examine the stores with lower QPS?
Sorting or even just randomizing sounds like a good idea to me. Otherwise, won't there be a system-wide tendency to transfer leases to replicas at the front of their range descriptors. The comment on RangeDescriptor.Replicas
says the order is arbitrary, but I'm not sure that that's true in practice.
a082410
to
87d8d74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I think the only question that may deserve more thought is the naming of the cluster settings. Using the inherited setting names is convenient, but we could do better if we don't mind throwing them out.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/generated/settings/settings.html, line 21 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did this get out of sync with the code?
Good catch, looks like I messed that up. Fixed, although I couldn't figure out how to get the file regenerated automatically.
pkg/cmd/roachtest/rebalance_load.go, line 16 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Had the same question as @BramGruneir. That reasoning checks out with me.
Probably worth describing what happens without
kv.allocator.stat_based_rebalancing.enabled=true
somewhere in the test.
Done.
pkg/storage/allocator.go, line 44 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
decisions that get adjusted
Done. After this settles down I'd like to rename all the "load-based lease rebalance" naming to reference follow-the-workload instead.
pkg/storage/allocator.go, line 1011 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
double "-"
Done.
pkg/storage/replica_rankings.go, line 62 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The interaction between
byQPS
andrrAccumulator
in this function is subtle. It looked very strange to me until I realized why you gave it these semantics (so that all 128 top replicas are considered and we don't end up staring at the top few every time the StoreRebalancer runs, right?). I'd call that out somewhere. Also call out that because of this behavior we have to be careful about increasingnumTopReplicasToTrack
or we could end up adding large delays between when the hottest few replicas are considered for rebalancing.
That's one reason, although given that the storeRebalancer
continues until it either runs out of replicas or determines the store is balanced, I don't think it's a very important one.
The issue that was more on my mind was that we could be (and often will be, given that sufficient rebalancing activity triggers a re-gossiping of the store descriptor) loading a new accumulator while the storeRebalancer
is partway through iterating over all the replicas, causing us to have to re-inspect a bunch of the same replicas again. I guess that's not a huge deal either, though, as long as it doesn't happen repeatedly.
It seems like maybe the consumer of the rankings should just pull out all the rankings at once (similar to how the producer puts all the rankings in at once via the accumulator).
pkg/storage/replica_rankings.go, line 75 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is almost always going to be the case during steady-state operations, right? In that case, it seems like there's a natural fast-path that falls out where we compare against the head of the priority queue and only pop and push if the new replica will result in the head being evicted.
Good point, done.
pkg/storage/replica_rankings.go, line 80 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'd add a comment to this as well because it is used externally of the
replicaRankings
abstraction. That will help justify why we have this extra layer of structure instead of havingreplicaRankings
hold anrrPriorityQueue
directly.
Done.
pkg/storage/store_rebalancer.go, line 45 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
There's an interesting reason why this can't be a queue itself. Might be worth adding that to this comment.
Done.
pkg/storage/store_rebalancer.go, line 208 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Sorting or even just randomizing sounds like a good idea to me. Otherwise, won't there be a system-wide tendency to transfer leases to replicas at the front of their range descriptors. The comment on
RangeDescriptor.Replicas
says the order is arbitrary, but I'm not sure that that's true in practice.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only question that may deserve more thought is the naming of the cluster settings. Using the inherited setting names is convenient, but we could do better if we don't mind throwing them out.
I don't have any strong opinions about this. @BramGruneir do you?
Reviewed 1 of 9 files at r8, 1 of 3 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/allocator.go, line 44 at r5 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Done. After this settles down I'd like to rename all the "load-based lease rebalance" naming to reference follow-the-workload instead.
"follow-the-workload" SGTM.
pkg/storage/replica_rankings.go, line 62 at r4 (raw file):
The issue that was more on my mind was that we could be (and often will be, given that sufficient rebalancing activity triggers a re-gossiping of the store descriptor) loading a new accumulator while the storeRebalancer is partway through iterating over all the replicas, causing us to have to re-inspect a bunch of the same replicas again. I guess that's not a huge deal either, though, as long as it doesn't happen repeatedly.
It's relatively cheap to inspect replicas that don't need to be rebalanced though, right?
It seems like maybe the consumer of the rankings should just pull out all the rankings at once (similar to how the producer puts all the rankings in at once via the accumulator).
That would be easier to understand. This is kind of prescriptive at the moment, so I think it makes sense to push the specialized complexity to the consumer. Feel free to leave a TODO instead of having that slow down the merge.
This makes it easier to pick replicas whose leases should be transferred when a store is overloaded. Release note: None
87d8d74
to
13d5263
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_rankings.go, line 62 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The issue that was more on my mind was that we could be (and often will be, given that sufficient rebalancing activity triggers a re-gossiping of the store descriptor) loading a new accumulator while the storeRebalancer is partway through iterating over all the replicas, causing us to have to re-inspect a bunch of the same replicas again. I guess that's not a huge deal either, though, as long as it doesn't happen repeatedly.
It's relatively cheap to inspect replicas that don't need to be rebalanced though, right?
It seems like maybe the consumer of the rankings should just pull out all the rankings at once (similar to how the producer puts all the rankings in at once via the accumulator).
That would be easier to understand. This is kind of prescriptive at the moment, so I think it makes sense to push the specialized complexity to the consumer. Feel free to leave a TODO instead of having that slow down the merge.
Yeah, done.
In order to better balance the QPS being served by each store to avoid overloaded nodes. Fixes cockroachdb#21419 Release note (performance improvement): Range leases will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node.
It consistently passes with store-level load-based lease rebalancing, but fails more often than not without it. Release note: None
13d5263
to
094bf14
Compare
I have no problem throwing out the old ones. But we have to be sure that we stop a cluster upgrade if someone is using an old one that not set to the default. They most likely changed it for a reason. |
Is there any precedent for that? Using the stats-based rebalancing feature has been pretty strongly discouraged. People may be using it in test clusters, but I'd be pretty surprised if it's being used in any real production environments. Also, I want to remove the old stats-based rebalancing code as part of #28852, so the old setting isn't gonna be very useful. |
How do we know that nobody anyone's using it? It's there's an option of it and I'd bet people have toyed with it. I have no idea how we go about deprecating a cluster setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r10, 9 of 9 files at r11, 3 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
We've removed cluster settings before (https://github.com/cockroachdb/cockroach/blob/master/pkg/settings/registry.go#L36), we've just never done anything like stopping an upgrade due to them. I'd think doing the same as before (retiring the setting and including a release note) is sufficient. |
SGTM |
bors r+ I'll follow up with the setting changes. |
28340: storage: make lease rebalancing decisions at the store level r=a-robinson a=a-robinson In order to better balance the QPS being served by each store to avoid overloaded nodes. Fixes #21419 Release note (performance improvement): Range leases will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node. 28848: opt: Fix rule cycle bug r=andy-kimball a=andy-kimball The PushFilterIntoJoinLeftAndRight and TryDecorrelateSelect rules can cycle with one another in a rare case: 1. Right side of join has outer column due to being un-decorrelatable. 2. Filter conjunct is pushed down to both left and right side by mapping equivalencies in PushFilterIntoJoinLeftAndRight. 3. Left conjunct is pulled back up to join condition by TryDecorrelateSelect. Steps #2 and #3 will cycle with one another. Cycle detection is not possible in this case, because the left side keeps changing (because new conjuct is pushed down to it each time). The fix is simple: don't let PushFilterIntoJoinLeftAndRight push down filters if either the left or right side has outer column(s). This fixes #28818. Release note: None Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Andrew Kimball <[email protected]>
Build succeeded |
Follow-up to cockroachdb#28340, which did this for just leases. Fixes cockroachdb#17979 Release note (performance improvement): Range replicas will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node.
Follow-up to cockroachdb#28340, which did this for just leases. Fixes cockroachdb#17979 Release note (performance improvement): Range replicas will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node.
Follow-up to cockroachdb#28340, which did this for just leases. Fixes cockroachdb#17979 Release note (performance improvement): Range replicas will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node.
Follow-up to cockroachdb#28340, which did this for just leases. Fixes cockroachdb#17979 Release note (performance improvement): Range replicas will be automatically rebalanced throughout the cluster to even out the amount of QPS being handled by each node.
28852: storage: make load-based replica rebalancing decisions at the store level r=a-robinson a=a-robinson Built on top of #28340, which is where the first 3 commits are from. This is still somewhat incomplete, in that it's missing unit tests and I'm only just now running tpc-c 10k on it. Sending out now to start the discussion of whether to include it in 2.1, since we're obviously very late in the intended development cycle. Co-authored-by: Alex Robinson <[email protected]>
In order to better balance the QPS being served by each store to avoid
overloaded nodes.
Fixes #21419
Release note (performance improvement): Range leases will be
automatically rebalanced throughout the cluster to even out the amount
of QPS being handled by each node.