Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allocation thrashing prevention and observability #104292

Open
kvoli opened this issue Jun 2, 2023 · 3 comments
Open

kvserver: allocation thrashing prevention and observability #104292

kvoli opened this issue Jun 2, 2023 · 3 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Jun 2, 2023

This is a tracking issue for undesirable behavior found in testing. Many of these issues have been filed and made better already. The purpose is to consolidate discussion for future reference.

We are developing significant changes to the allocation system in #103320. This issue references existing concepts which don't all translate to the changed system; the principles still apply however not all solutions do.

Thrashing

Allocation thrashing is where the cluster continuously reconfigures range replicas and leases without materially improving the long term distribution.

Thrashing is always possible and is usually the result of trading off stability for reactiveness. Here's a list of known thrashing causes in CRDB, which could be actionable or have had recent development to address. The issues came from a specific instance of thrashing in a multi-region CRDB cluster running a uniform workload on v22.2.9.

Details on cluster configuration

Localities

There is a lease preference set on us-east-1 and a constraint that 2 voters are required in the region. There are 3 zones in us-east-1:

us-east-1a: 9  10
us-east-1b: 11 12
us-east-1c: 13

Given the state of a range's voters, the equivalence classes for nodes:

equivalent other_voter
11|12|13   (9|10)
9|10|13    (11|12)
9|10|11|12 (13)

Allocator storepool / store rebalancer see different states of the cluster.

Details

Bad transfer from perspective of store rebalancer, s9 has higher load than s12.

transferring lease for r372 (qps=35.46) to store s9 (qps=2095.87) from local store s12 (qps=2036.49)

The store pool is used for most of the candidate selection, it must have had a different view. This can be shown from reconstructing logging on a node (n12). qps is the storepool queries-per-second, sr_qps is the storepools view.

avg ranges=224 leases=48  qps=909.61
----
1:  ranges=211 leases=8   qps=10.67   b4_qps=10.73
2:  ranges=211 leases=8   qps=0.07    b4_qps=0.07 
3:  ranges=212 leases=9   qps=3.98    b4_qps=3.99 
4:  ranges=228 leases=3   qps=6.00    b4_qps=6.00 
5:  ranges=228 leases=4   qps=16.66   b4_qps=16.66 
6:  ranges=228 leases=6   qps=7.72    b4_qps=7.73
7:  ranges=227 leases=5   qps=6.02    b4_qps=6.02
8:  ranges=229 leases=7   qps=6.11    b4_qps=6.11
9:  ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04
10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22
11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82
12: ranges=232 leases=81  qps=2813.95 b4_qps=2813.65 sr_qps=1138.21*
13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43

There is too much obviously useless rebalancing, which has greater cost than benefit.

Details

Here's a summary of replica rebalances performed by the store rebalancer during a rebalance pass. Note the duration between changes. The goal is QPS convergence, however without also transferring a lease, the action isn't that useful - especially when the replica only has <10 QPS.

17:39:00  10 -> 9  qps=36.66 r=1101
17:39:14  10 -> 12 qps=36.53 r=527
17:39:27  10 -> 12 qps=36.44 r=1098
17:39:47  10 -> 12 qps=34.79 r=336
17:40:06  10 -> 9  qps=34.71 r=204
17:40:19  10 -> 9  qps=34.49 r=1089
17:40:37  10 -> 9  qps=7.64  r=440
# lease also transfers here from 13 -> 12
17:41:16  13 -> 10 qps=7.35  r=355
17:41:44  11 -> 10 qps=7.34  r=398

Lease transfers use the cluster domain (mean) for creating thresholds, lease stores were over-eager to transfer a lease as a result.

Too many actions occur in a short period for how confident we are the actions would be judged as good in near future.

There's nothing stopping a store from shedding 128 leases during the rebalance loop, in the matter of a few seconds or less. This exacerbates the gossip capacity change issue mentioned below. Also, it doesn't seem like a good idea to be as aggressive as this unless our information is similarly accurate on the same timescale. There's a 5s period before a store will account a replica's load - so other stores wouldn't be aware of the capacity change until after then. The source store would know, however it was likely overwritten (see capacity change triggered gossip mentions).

Related to the above point, lease and replica change triggered gossip will overwrite QPS estimates. Leading to a more consistent but incorrect view of the world across storepools in the cluster.

  • TBD: We could constrain the number of actions per period, or % load churn.
  • 23.1+: Use delta rather than absolute # of capacity changes to trigger gossip. asim: rework gossip component #93945
  • 23.1+: The lease/range count change required also got bumped earlier in kvserver: gossip less aggressively on capacity +/- #93555.
  • TBD: Is this enough? Now less likely but not impossible. We could apply the load delta to the cached capacity which is gossiped. See above point, other stores wouldn't know about the receiver +delta, only the sender -delta.
Details

The state of the world post rebalance loop on n12 @ 17:39:11:

avg ranges=224 leases=48  qps=909.61
----
1:  ranges=211 leases=8   qps=10.67   b4_qps=10.73
2:  ranges=211 leases=8   qps=0.07    b4_qps=0.07 
3:  ranges=212 leases=9   qps=3.98    b4_qps=3.99 
4:  ranges=228 leases=3   qps=6.00    b4_qps=6.00 
5:  ranges=228 leases=4   qps=16.66   b4_qps=16.66 
6:  ranges=228 leases=6   qps=7.72    b4_qps=7.73
7:  ranges=227 leases=5   qps=6.02    b4_qps=6.02
8:  ranges=229 leases=7   qps=6.11    b4_qps=6.11
9:  ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04
10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22
11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82
12: ranges=232 leases=81  qps=2813.95 b4_qps=2813.65 sr_qps=1138.21*
13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43

It seems likely that n12 got store descriptor updates between 17:39:{06-11} which were prior to the impact of rebalancing being realized (>5s + gossip min). The update blind writes and would overwrite any previous estimates.

The stores gossip on lease/qps changes.

  • The change source store knows it now has less load. This should be reflected locally.
  • The change target store doesn't know it now has more load, just that someone has less.

What is odd, is that the lease count changes are reflected, but not the QPS changes. When we eagerly gossip on range/lease count changes, we use a cached descriptor - with the updated lease/range count applied. No other change would be applied. This could lead to the inconsistent state between leases/qps seen below. The effect is also symmetric, both the source and target store would have any estimate overwritten with an identical QPS to prior to the
transfer but an updated lease count.

The state of the world on n9 @ 17:39:11 for reference:

avg ranges=224 leases=48  qps=906.93
----
1:  ranges=211 leases=8   qps=10.67
2:  ranges=211 leases=8   qps=0.07 
3:  ranges=212 leases=9   qps=3.98 
4:  ranges=228 leases=3   qps=6.00 
5:  ranges=228 leases=4   qps=16.66
6:  ranges=228 leases=6   qps=7.72 
7:  ranges=227 leases=5   qps=6.02 
8:  ranges=229 leases=7   qps=6.11 
9:  ranges=229 leases=111 qps=1724.67
10: ranges=217 leases=128 qps=3381.77
11: ranges=231 leases=122 qps=2560.35
12: ranges=232 leases=81  qps=2813.95
13: ranges=237 leases=138 qps=1252.24

Note how it is fairly consistent with n12's storepool state but not the store rebalancer on n12.

There's no concept of cluster level, per-store perfect number based on constraints and availability.

i.e. Synthesize all the constraints, localities (inc diversity) and load into a target number for each store. The number represents what the load should be on that store if the placement were optimal.

  • Likely too expensive to find the optimal number on each input change or decision time.
  • It isn't clear whether this is desirable either. There's more load than just replicas and leases in the cluster and these aren't always highly correlated. Users want predictable, explainable latency + throughput. Ignoring this load has been the strategy up to 23.1.

Observability

It is tedious to reconstruct a node's view of the world and recent "impact" it expects from its changes

  • Periodic logging of where a store sent stuff (period aggregate) - broken down by reason and including the load deltas. Rebalancing/Transfers appear useful, unsure on other reasons.

There's no easy thrashing or stability observability

Jira issue: CRDB-28445

@kvoli kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jun 2, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 2, 2023
@kvoli kvoli self-assigned this Jun 5, 2023
@sumeerbhola
Copy link
Collaborator

Given the state of a range's voters, the equivalence classes for nodes:
equivalent other_voter
11|12|13 (9|10)
9|10|13 (11|12)
9|10|11|12 (13)

Are all of 9, 10, 11, 12, 13 voters? Or is this a description of the nodes in the cluster and not necessarily who currently has the replicas? I didn't understand this equivalence class computation. I would have expected the constraint and lease preference to be trivially satisfied since all the voters are in the expected region, and the only thing we would be trying to improve is the diversity score (if we happened to have more than 1 replica in a zone). What thrashing happened in this case?

@kvoli
Copy link
Collaborator Author

kvoli commented Jun 5, 2023

Are all of 9, 10, 11, 12, 13 voters?

No. The setup is 5 voters, with a lease preference on us-east-1 (9 10 11 12 13) and 2 voters in us-east-1.

[...] and the only thing we would be trying to improve is the diversity score (if we happened to have more than 1 replica in a zone)

This is for diversity score only. The equivalence class is based on equivalent diversity score. It helps explain the movements in the snippets for replica rebalancing.

If there are only 2 voters in us-east-1, then for 11|12|13 (9|10), (9|10) is the voter not being touched and 11|12|13 has the replace voter and its equally diverse alternatives.

The thrashing we saw looked like this (should have included in issue):

image

Note near the end, rebalancing was disabled which is why the series converged.

@kvoli
Copy link
Collaborator Author

kvoli commented Jun 5, 2023

The primary thrashing cause afaict was capacity change triggered gossip updates overwriting local change estimates.

#93532 is similar to the sequencing in allocator2 and would be helpful here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants