Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: capacity change triggered gossip interacts poorly with local storepool estimates #104552

Open
kvoli opened this issue Jun 7, 2023 · 0 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Jun 7, 2023

Describe the problem

In #104292 we observed allocation thrashing, where stores cyclically shifted around leases/replicas for no medium (>1m) load distribution improvement. See screenshot below.

image

Whilst there were multiple contributors, we suspect capacity change triggered gossip to be the largest.

The two relevant mechanisms which are interacting poorly are:

  1. When a allocation change is applied, the store which initiated the change will update its local storepool to reflect the estimated impact of the change. The local store overwrites the existing store descriptor of affected stores to include the estimate.
  2. When a store adds or removes either a lease or replica (capacity change), it will check if the delta between the last gossiped lease/replica count is greater than some threshold; if so, it gossips the last (cached) capacity it has, with the updated lease/replica count.

(1) are blind writes and these estimates last until the next gossiped capacity for a store arrives, after which they are also blindly overwritten. The capacity gossiped in (2) will have stale information aside from the lease/replica count, meaning when it arrives and overwrites (1), the storepool state won't include the load (QPS,CPU) changes. This combination propagates stale information around the cluster, with which allocation decisions are made.

Expected behavior

Thrashing does not occur as a result of low range count, uniformly loaded ranges with moderate-high load.

Additional data / screenshots

Some important clarifications:

It is hard to create a scenario where this occurs organically. In practice, we haven't seen this occur outside of synthetic workloads but certainly could. The reason being that the delta required to trigger gossip in (1) is 5% relative and at least 51. So unless a store rebalances 5% of leases in under 10 seconds, it wouldn't trigger. In clusters with a large amount of data, normally the distribution of load per-range is highly skewed, with 95% of replicas having minimal load - so they wouldn't be used as rebalance targets. In other words, once each store has at least 2560 leases and replicas, it is not possible to occur due to rebalancing (128 max number of store rebalancer actions per loop).

This situation is most likely to occur under a scenario where there are relatively few, uniformly loaded ranges. The store rebalancer will shed more than 5% of its leases quickly and trigger the capacity change gossip (potentially multiple times), which in turn overwrites the local estimated impact as mentioned above.

Using #104292 as an example:

The state of the world prior to the rebalance loop on n12 @ 17:39:06

avg ranges=224  leases=45  qps=915.90
----
9:  ranges=229  leases=84  qps=1731.13
10: ranges=218  leases=130 qps=3465.22
11: ranges=231  leases=122 qps=2588.82
12: ranges=232  leases=122 qps=2813.65
13: ranges=236  leases=79  qps=1250.63
Full
avg ranges=224  leases=45  qps=915.90
----
1:  ranges=211  leases=8   qps=10.73
2:  ranges=211  leases=8   qps=0.07 
3:  ranges=212  leases=9   qps=3.99 
4:  ranges=228  leases=3   qps=6.00 
5:  ranges=228  leases=4   qps=16.66 
6:  ranges=228  leases=6   qps=7.73
7:  ranges=227  leases=5   qps=6.02
8:  ranges=229  leases=7   qps=6.11
9:  ranges=229  leases=84  qps=1731.13
10: ranges=218  leases=130 qps=3465.22
11: ranges=231  leases=122 qps=2588.82
12: ranges=232  leases=122 qps=2813.65
13: ranges=236  leases=79  qps=1250.63

n12 transfers somewhere around 40 leases, mostly from the rebalance loop. The lease transfers have an estimated impact of -1675 QPS on n12, meaning its resulting QPS after the rebalance loop should be 1138, down from 2813.

First few transfers
17:39:06  12 -> 13 qps=48.79
17:39:06  12 -> 13 qps=37.78
17:39:06  12 -> 13 qps=37.64
17:39:06  12 ->  9 qps=37.38
17:39:06  12 ->  9 qps=37.27
17:39:06  12 ->  9 qps=36.97
17:39:06  12 -> 13 qps=36.78
...
more lease transfers

However, after the rebalance loop on n12 @ 17:39:11 we see the storepool QPS hasn't actually changed (2813.65->2813.95) but the lease count has decreased corresponding to the transfers (122 -> 81).

avg ranges=224 leases=48  qps=909.61
----
9:  ranges=229 leases=112 qps=1730.95
10: ranges=218 leases=128 qps=3381.77
11: ranges=231 leases=122 qps=2588.82
12: ranges=232 leases=81  qps=2813.95 est_qps=1138.21
13: ranges=236 leases=138 qps=1252.24
Full
avg ranges=224 leases=48  qps=909.61
----
1:  ranges=211 leases=8   qps=10.67
2:  ranges=211 leases=8   qps=0.07 
3:  ranges=212 leases=9   qps=3.98 
4:  ranges=228 leases=3   qps=6.00 
5:  ranges=228 leases=4   qps=16.66
6:  ranges=228 leases=6   qps=7.72 
7:  ranges=227 leases=5   qps=6.02 
8:  ranges=229 leases=7   qps=6.11 
9:  ranges=229 leases=112 qps=1730.95
10: ranges=218 leases=128 qps=3381.77
11: ranges=231 leases=122 qps=2588.82
12: ranges=232 leases=81  qps=2813.95 est_qps=1138.21
13: ranges=236 leases=138 qps=1252.24

I've focused on how a store's local view of its own capacity became incoherent but this also affects other stores receiving the shedded load. These recipients could also be rebalancing at the same time 2 but regardless they would gossip on capacity changes if receiving a sufficient number of the shedded leases.

The delta between the before and after storepool state on n12 demonstrates other stores state also became incoherent on n12, where they should have more load after the rebalancing. Note the lease deltas don't actually add up to be correct, since the starting state isn't coherent either.

# delta before vs after rebalance loop
avg ranges=0   leases=+3  qps=-6.28
----
9:  ranges=0   leases=+28 qps=-0.18
10: ranges=0   leases=-2  qps=-83.45
11: ranges=0   leases=0   qps= 0.00
12: ranges=0   leases=-41 qps=+0.30
13: ranges=0   leases=+59 qps=+1.61
Full
# delta before vs after rebalance loop
avg ranges=0   leases=+3  qps=-6.28
----
1:  ranges=0   leases=0   qps=-0.60
2:  ranges=0   leases=0   qps= 0.00
3:  ranges=0   leases=0   qps=-0.01
4:  ranges=0   leases=0   qps= 0.00
5:  ranges=0   leases=0   qps= 0.00
6:  ranges=0   leases=0   qps=-0.01
7:  ranges=0   leases=0   qps= 0.00
8:  ranges=0   leases=0   qps= 0.00
9:  ranges=0   leases=+28 qps=-0.18
10: ranges=0   leases=-2  qps=-83.45
11: ranges=0   leases=0   qps= 0.00
12: ranges=0   leases=-41 qps=+0.30
13: ranges=0   leases=+59 qps=+1.61

The summary of this problem is quality of information used in allocation decisions worsens, resulting in bad decisions which in turn spurs on more bad information. Similar to setting the gossip delay to 20s, like an example simulator test does.

7000 ┤ ╭───╮
6533 ┤ │ │
6067 ┤ │ │
5600 ┤ │ ╰╮
5133 ┤ │ │ ╭───╮
4667 ┤ │ │ │ │
4200 ┤ │ │ ╭───╮ ╭───╮ │ │
3733 ┤ │ │ │ │ │ │ │ │
3267 ┤ │ │ │ │ │ │ │ │
2800 ┤ │ │╭╭─────╮╭╯ │ ╭╯ ╰╮ ╭──╮ ╭─────╮
2333 ┤ │ │││ │││ │ │ │╭╯ │ │ │
1867 ┤ │ ╰││ │╰╮╭───╮╮│ ││╭──╰╮╭───────────╮╮╭─────────────╮
1400 ┤ │ ││ ││╭╯ │││ │││ ╭╯│ │ │││ │
933 ┤ │ ╭╭╯ ╰╮│╭──────────────────────────────────────────────────────────────
467 ┤ │ ││ │││ ││ ││││ ││ │ ││ │
0 ┼────────────────╯────────────────────────────────╯───────────────╯

Environment:

  • CockroachDB version 22.2.7

Additional context

Reduced max throughput by up to 15% as resources were wasted rebalancing and stores were sometimes transiently overloaded.

Add any other context about the problem here.

Jira issue: CRDB-28598

Footnotes

  1. In 23.1 the delta must exceed both 5% and a minimum of 5. In <23.1, the absolute number of changes is used and not a delta. The number of changes (lease/replica +-) must exceed 1% for leases and 1% or 3 for replicas.

  2. However its unlikely that a lot of other transfers interleaved here as n12's store rebalancer loop completes within 1s @ 17:39:07, n12s view could only be verified as of 17:39:11.

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Jun 7, 2023
@kvoli kvoli self-assigned this Jun 7, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 7, 2023
@kvoli kvoli removed their assignment Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant