-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: allocation thrashing prevention and observability #104292
Comments
Are all of 9, 10, 11, 12, 13 voters? Or is this a description of the nodes in the cluster and not necessarily who currently has the replicas? I didn't understand this equivalence class computation. I would have expected the constraint and lease preference to be trivially satisfied since all the voters are in the expected region, and the only thing we would be trying to improve is the diversity score (if we happened to have more than 1 replica in a zone). What thrashing happened in this case? |
No. The setup is 5 voters, with a lease preference on
This is for diversity score only. The equivalence class is based on equivalent diversity score. It helps explain the movements in the snippets for replica rebalancing. If there are only 2 voters in The thrashing we saw looked like this (should have included in issue): Note near the end, rebalancing was disabled which is why the series converged. |
The primary thrashing cause afaict was capacity change triggered gossip updates overwriting local change estimates. #93532 is similar to the sequencing in allocator2 and would be helpful here. |
This is a tracking issue for undesirable behavior found in testing. Many of these issues have been filed and made better already. The purpose is to consolidate discussion for future reference.
We are developing significant changes to the allocation system in #103320. This issue references existing concepts which don't all translate to the changed system; the principles still apply however not all solutions do.
Thrashing
Allocation thrashing is where the cluster continuously reconfigures range replicas and leases without materially improving the long term distribution.
Thrashing is always possible and is usually the result of trading off stability for reactiveness. Here's a list of known thrashing causes in CRDB, which could be actionable or have had recent development to address. The issues came from a specific instance of thrashing in a multi-region CRDB cluster running a uniform workload on
v22.2.9
.Details on cluster configuration
Localities
There is a lease preference set on
us-east-1
and a constraint that 2 voters are required in the region. There are 3 zones inus-east-1
:Given the state of a range's voters, the equivalence classes for nodes:
Allocator storepool / store rebalancer see different states of the cluster.
Details
Bad transfer from perspective of store rebalancer, s9 has higher load than s12.
The store pool is used for most of the candidate selection, it must have had a different view. This can be shown from reconstructing logging on a node (
n12
).qps
is the storepool queries-per-second,sr_qps
is the storepools view.avg ranges=224 leases=48 qps=909.61 ---- 1: ranges=211 leases=8 qps=10.67 b4_qps=10.73 2: ranges=211 leases=8 qps=0.07 b4_qps=0.07 3: ranges=212 leases=9 qps=3.98 b4_qps=3.99 4: ranges=228 leases=3 qps=6.00 b4_qps=6.00 5: ranges=228 leases=4 qps=16.66 b4_qps=16.66 6: ranges=228 leases=6 qps=7.72 b4_qps=7.73 7: ranges=227 leases=5 qps=6.02 b4_qps=6.02 8: ranges=229 leases=7 qps=6.11 b4_qps=6.11 9: ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04 10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22 11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82 12: ranges=232 leases=81 qps=2813.95 b4_qps=2813.65 sr_qps=1138.21* 13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43
There is too much obviously useless rebalancing, which has greater cost than benefit.
Details
Here's a summary of replica rebalances performed by the store rebalancer during a rebalance pass. Note the duration between changes. The goal is QPS convergence, however without also transferring a lease, the action isn't that useful - especially when the replica only has <10 QPS.
Lease transfers use the cluster domain (mean) for creating thresholds, lease stores were over-eager to transfer a lease as a result.
Too many actions occur in a short period for how confident we are the actions would be judged as good in near future.
There's nothing stopping a store from shedding 128 leases during the rebalance loop, in the matter of a few seconds or less. This exacerbates the gossip capacity change issue mentioned below. Also, it doesn't seem like a good idea to be as aggressive as this unless our information is similarly accurate on the same timescale. There's a 5s period before a store will account a replica's load - so other stores wouldn't be aware of the capacity change until after then. The source store would know, however it was likely overwritten (see capacity change triggered gossip mentions).
Related to the above point, lease and replica change triggered gossip will overwrite QPS estimates. Leading to a more consistent but incorrect view of the world across storepools in the cluster.
Details
The state of the world post rebalance loop on
n12
@17:39:11
:avg ranges=224 leases=48 qps=909.61 ---- 1: ranges=211 leases=8 qps=10.67 b4_qps=10.73 2: ranges=211 leases=8 qps=0.07 b4_qps=0.07 3: ranges=212 leases=9 qps=3.98 b4_qps=3.99 4: ranges=228 leases=3 qps=6.00 b4_qps=6.00 5: ranges=228 leases=4 qps=16.66 b4_qps=16.66 6: ranges=228 leases=6 qps=7.72 b4_qps=7.73 7: ranges=227 leases=5 qps=6.02 b4_qps=6.02 8: ranges=229 leases=7 qps=6.11 b4_qps=6.11 9: ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04 10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22 11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82 12: ranges=232 leases=81 qps=2813.95 b4_qps=2813.65 sr_qps=1138.21* 13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43
It seems likely that
n12
got store descriptor updates between17:39:{06-11}
which were prior to the impact of rebalancing being realized (>5s + gossip min). The update blind writes and would overwrite any previous estimates.The stores gossip on lease/qps changes.
What is odd, is that the lease count changes are reflected, but not the QPS changes. When we eagerly gossip on range/lease count changes, we use a cached descriptor - with the updated lease/range count applied. No other change would be applied. This could lead to the inconsistent state between leases/qps seen below. The effect is also symmetric, both the source and target store would have any estimate overwritten with an identical QPS to prior to the
transfer but an updated lease count.
The state of the world on
n9
@17:39:11
for reference:Note how it is fairly consistent with
n12
's storepool state but not the store rebalancer onn12
.There's no concept of cluster level, per-store perfect number based on constraints and availability.
i.e. Synthesize all the constraints, localities (inc diversity) and load into a target number for each store. The number represents what the load should be on that store if the placement were optimal.
Observability
It is tedious to reconstruct a node's view of the world and recent "impact" it expects from its changes
There's no easy thrashing or stability observability
Jira issue: CRDB-28445
The text was updated successfully, but these errors were encountered: