-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prototype] storage: Make rebalance decisions at store-level #26608
[prototype] storage: Make rebalance decisions at store-level #26608
Conversation
Underlying the reasoning to change from the per-replica replicate queue to a new store-level construct is that the replicate queue doesn't have a great sense of how the replica it's looking at compares to the other replicas on the store. We have the 10th/25th/50th/75th/90th percentiles, but in practice those don't appear to be nearly as useful as being able to grab the top N replicas on the store and pick which of them should be moved. And even if the replicate queue could determine that the replica it was looking at was the hottest replica, or the Nth hottest replica, it's going to be important for it to also know whether the other hot replicas are able to be moved. If the top 17 replicas are movable, maybe we don't want to move the 18th hottest replicas. But if the top 17 replicas can't be moved due to zone constraints and leaseholder preferences, then we probably do want to move the 18th hottest replica. It's going to be a lot easier (and will require breaking way fewer abstractions) to do that from something operating at the store-level than at the replica-level. |
Those results look really nice! Did you try performing the experiment on a 30 node cluster as well? I'm sure this will be even more helpful there. One detail that pops out to me is that we're basing everything on qps without making a distinction between read and write qps. This is interesting because I'd expect transferring a lease for a range with high read-qps to be more impactful than transferring the lease for a range with high write-qps, which probably requires a full replica rebalance to alleviate an overloaded node. Do you think it makes sense to track these separately?
I don't necessarily think a full RFC is necessary for any changes here (especially as you're prototyping them), but it would probably be a good idea to eventually go back and update https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161026_leaseholder_rebalancing.md#future-directions with your new findings and plan. Review status: complete! 0 of 0 LGTMs obtained pkg/server/prototype_allocator.go, line 29 at r1 (raw file):
The prototype doesn't actually use this, right? Comments from Reviewable |
As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None
With this update, TPC-C 10k on 30 went from overloaded to running at peak efficiency over the course of about 4 hours (the manual partitioning approach takes many hours to move all the replicas as well, for a point of comparison). This is without having to run the replica scatter from cockroachdb#26438. Doing a 5 minute run to get a result that doesn't include all the rebalancing time shows: _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 290.9s 124799.1 97.0% 548.6 486.5 872.4 1140.9 2281.7 10200.5 I think it may have a small bug in it still, since at one point early on one of the replicas from the warehouse table on the node doing the relocating thought that it had 16-17k QPS, which wasn't true by any other metric in the system. Restarting the node fixed it though. I'm not too concerned about the bug, since I assume I just made a code mistake, not that anything about the approach fundamentally leads to a random SQL table replica gets 10s of thousands of QPS. Range 1 is also back to getting a ton of QPS (~3k) even though I raised the range cache size from 1M to 50M. Looking at slow query traces shows a lot of range lookups, way more than I'd expect given that ranges weren't moving around at the time of the traces. Release note: None
61805d0
to
853ac61
Compare
Nice!
Was the rate of rebalancing steady throughout this entire 4 hour period?
Did this eventually settle down? If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios. There might also be other issues lurking as well. Review status: complete! 0 of 0 LGTMs obtained Comments from Reviewable |
Not the entire time. About 2 hours were actually spent primarily doing lease transfers, which makes sense given the code but probably made the rebalancing period take longer than needed. Judging by the logs, that was because of the bug I mentioned where a replica on n1 had a ridiculously high reported QPS. Once that resolved itself, though, the rest of the time was a pretty steady stream of replica adds/removes.
Not that I ever saw. Agreed that it's worth getting to the bottom of. |
That's certainly part of it. But what isn't clear is why the non-meta descriptors are being evicted. For example, check out this partial excerpt of events on an arbitrarily chosen range on one of the nodes. We keep evicting it and re-adding it to the range cache even though it isn't changing at all:
|
Almost all of the evictions in my current tpc-c 10k cluster are due to context cancellations for |
Sorry, I forgot to include the relevant code in my last comment: cockroach/pkg/kv/dist_sender.go Lines 1033 to 1039 in 718f6bd
|
And now that I've added in a patch for that, I'm seeing some other questionable behavior causing a lot of range cache evictions. It appears as though whenever we get a Although it looks like the biggest issue is still our sending of Example logs from logspy grepping for r1:
|
When running TPC-C 10k on a 30 node cluster without partitioning, range 1 was receiving thousands of qps while all other ranges were receiving no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was context cancellations causing range descriptors to be evicted from the range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns being sent for transactions with no anchor key, accounting for thousands of QPS even after cockroachdb#26764 was fixed. This causes the same outcome as the old code without the load, because without this change we'd just send the request and get back a REASON_TXN_NOT_FOUND error, which would cause the function to return true. It's possible that we should instead avoid the heartbeat loop at all for transactions without a key, or that we should put in more effort to prevent such requests from even counting as transactions (a la cockroachdb#26741, which perhaps makes this change unnecessary?). Advice would be great. Release note: None
Sorry, I was mixing up I've sent out #26764 for the context cancellation problem and #26765 for all the |
SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on cockroachdb#26608. Release note: None
That sounds like a mistake. As the comment at The whole notion of range descriptor invalidation could use some reworking. Let's discuss that on #18988 instead of adding to this issue. |
SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on cockroachdb#26608. Release note: None
26764: kv: Don't treat context cancellation as a SendError r=a-robinson a=a-robinson SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on #26608. Release note: None --------------------------- `kv` is probably the package I'm least familiar with all the details of, so this may be a bad way of accomplishing this. If so, please let me know. If it's reasonable, I'll add a test that context cancellations don't affect the contents of the range cache. Co-authored-by: Alex Robinson <[email protected]>
As outlined in recent comments on #26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.
This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.
This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:
-- but for the prototype I only considered the one dimension that
affects TPC-C the most)
needs to be taken.
hottest in practice, since moving that one could disrupt user traffic,
but in the prototype this seems to work fine) and attempt to move them
to under-utilized stores. If this can be done simply by transferring
leases to under-utilized stores, then do so. If moving leases isn't
enough, then also rebalance replicas from the hottest store to
under-utilized stores.
In a real version of this code, the plan is roughly:
that is only responsible for moving leases/replicas off itself, not off
other stores. This avoids needing a centralized coordinator, and will
avoid the need to use the raft debug endpoint as long as we start
gossiping QPS per store info, since the store already has details about
the replicas on itself.
balance. It will switch to only making decisions based on
constraints/diversity/lease preferences, which is still needed since
the new store-level logic will only check for store-level balance,
not that all replicas' constraints are properly met.
preferences.
writes per second as well.
extend lease transfers to pass along QPS data to the new leaseholder
and preemptive snapshots to pass along WPS data to the new replica.
This may not be strictly necessary, as shown by the success of this
prototype, but should make for more reliable decision making.
I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.
Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing #26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.
Touches #26059, #17979
Release note: None
per-node
exec.success
metric over a few runs with this change, showing the per-node KV QPS converging over time:per-node goroutine counts over the same few runs, showing the overloaded outlier node getting healthier and converging to the other nodes:
I can write up a more formal RFC if folks think the write-up above (plus the success of this prototype) is insufficient as a proposal, but it's a good summary of my current plans.
@BramGruneir @nvanbenschoten @petermattis