kvserver: capacity change triggered gossip interacts poorly with local storepool estimates #104552
Labels
A-kv-distribution
Relating to rebalancing and leasing.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Describe the problem
In #104292 we observed allocation thrashing, where stores cyclically shifted around leases/replicas for no medium (>1m) load distribution improvement. See screenshot below.
Whilst there were multiple contributors, we suspect capacity change triggered gossip to be the largest.
The two relevant mechanisms which are interacting poorly are:
(1) are blind writes and these estimates last until the next gossiped capacity for a store arrives, after which they are also blindly overwritten. The capacity gossiped in (2) will have stale information aside from the lease/replica count, meaning when it arrives and overwrites (1), the storepool state won't include the load (QPS,CPU) changes. This combination propagates stale information around the cluster, with which allocation decisions are made.
Expected behavior
Thrashing does not occur as a result of low range count, uniformly loaded ranges with moderate-high load.
Additional data / screenshots
Some important clarifications:
It is hard to create a scenario where this occurs organically. In practice, we haven't seen this occur outside of synthetic workloads but certainly could. The reason being that the delta required to trigger gossip in (1) is 5% relative and at least 51. So unless a store rebalances 5% of leases in under 10 seconds, it wouldn't trigger. In clusters with a large amount of data, normally the distribution of load per-range is highly skewed, with 95% of replicas having minimal load - so they wouldn't be used as rebalance targets. In other words, once each store has at least 2560 leases and replicas, it is not possible to occur due to rebalancing (128 max number of store rebalancer actions per loop).
This situation is most likely to occur under a scenario where there are relatively few, uniformly loaded ranges. The store rebalancer will shed more than 5% of its leases quickly and trigger the capacity change gossip (potentially multiple times), which in turn overwrites the local estimated impact as mentioned above.
Using #104292 as an example:
The state of the world prior to the rebalance loop on
n12
@17:39:06
Full
n12
transfers somewhere around40
leases, mostly from the rebalance loop. The lease transfers have an estimated impact of-1675 QPS
onn12
, meaning its resulting QPS after the rebalance loop should be1138
, down from2813
.First few transfers
However, after the rebalance loop on
n12
@17:39:11
we see the storepoolQPS
hasn't actually changed (2813.65->2813.95
) but the lease count has decreased corresponding to the transfers (122 -> 81
).Full
I've focused on how a store's local view of its own capacity became incoherent but this also affects other stores receiving the shedded load. These recipients could also be rebalancing at the same time 2 but regardless they would gossip on capacity changes if receiving a sufficient number of the shedded leases.
The delta between the before and after storepool state on
n12
demonstrates other stores state also became incoherent onn12
, where they should have more load after the rebalancing. Note the lease deltas don't actually add up to be correct, since the starting state isn't coherent either.# delta before vs after rebalance loop avg ranges=0 leases=+3 qps=-6.28 ---- 9: ranges=0 leases=+28 qps=-0.18 10: ranges=0 leases=-2 qps=-83.45 11: ranges=0 leases=0 qps= 0.00 12: ranges=0 leases=-41 qps=+0.30 13: ranges=0 leases=+59 qps=+1.61
Full
# delta before vs after rebalance loop avg ranges=0 leases=+3 qps=-6.28 ---- 1: ranges=0 leases=0 qps=-0.60 2: ranges=0 leases=0 qps= 0.00 3: ranges=0 leases=0 qps=-0.01 4: ranges=0 leases=0 qps= 0.00 5: ranges=0 leases=0 qps= 0.00 6: ranges=0 leases=0 qps=-0.01 7: ranges=0 leases=0 qps= 0.00 8: ranges=0 leases=0 qps= 0.00 9: ranges=0 leases=+28 qps=-0.18 10: ranges=0 leases=-2 qps=-83.45 11: ranges=0 leases=0 qps= 0.00 12: ranges=0 leases=-41 qps=+0.30 13: ranges=0 leases=+59 qps=+1.61
The summary of this problem is quality of information used in allocation decisions worsens, resulting in bad decisions which in turn spurs on more bad information. Similar to setting the gossip delay to 20s, like an example simulator test does.
cockroach/pkg/kv/kvserver/asim/tests/testdata/example_rebalancing
Lines 91 to 106 in 618a893
Environment:
Additional context
Reduced max throughput by up to 15% as resources were wasted rebalancing and stores were sometimes transiently overloaded.
Add any other context about the problem here.
Jira issue: CRDB-28598
Footnotes
In 23.1 the delta must exceed both 5% and a minimum of 5. In <23.1, the absolute number of changes is used and not a delta. The number of changes (lease/replica +-) must exceed 1% for leases and 1% or 3 for replicas. ↩
However its unlikely that a lot of other transfers interleaved here as
n12
's store rebalancer loop completes within 1s @17:39:07
,n12
s view could only be verified as of17:39:11
. ↩The text was updated successfully, but these errors were encountered: