-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: leases thrash on ycsb/b #93540
Comments
This was introduced in #91633, however it was an issue before just not to the same extent. The issue is that gossip is triggered on capacity changes for lease count. When there are few enough ranges, this will trigger on every lease transfer. The store_rebalancer also locally updates it's own store descriptor and the target after a lease transfer, with the QPS of the range. These two mechanisms race, where the gossip occurs with stale values immediately following the transfer, then soon after the store_pool is also updated locally. This leads to a weird end state where the store_pool is inconsistent w.r.t actual load on the stores. With additional logging this is shown below:
This issue previously existed in the store pool, however the store rebalancer was unaffected as it kept a local copy of its own QPS and max threshold when considering rebalancing. It would however pick suboptimal targets due to the store pool being inconsistent. The robust resolution to this class of inconsistency issues in the state used in allocation decisions is #93532. A shorter term solution is to increase the capacity change gossip countdowns to a more reasonable number than 1%. |
Have a patch #93555 which resolves this issue: |
93555: kvserver: gossip less aggressively on capacity +/- r=shralex a=kvoli Gossip occurs periodically and on capacity changes, when lease, range, queries per second or writes per second changes since the last gossiped value, above some threshold. This however causes issues with the store pool state when there are frequent capacity changes due to rebalancing, as the storepool state becomes inconsistent when both gossip and local updates race. This induces thrashing in high load clusters. This patch reduces the likelihood of storepool races occurring by increasing the threshold required by capacity changes in order for them to trigger re-gossiping earlier than the default interval (10s). resolves #93540 Release note: None Co-authored-by: Austen McClernon <[email protected]>
Gossip occurs periodically and on capacity changes, when lease, range, queries per second or writes per second changes since the last gossiped value, above some threshold. This however causes issues with the store pool state when there are frequent capacity changes due to rebalancing, as the storepool state becomes inconsistent when both gossip and local updates race. This induces thrashing in high load clusters. This patch reduces the likelihood of storepool races occurring by increasing the threshold required by capacity changes in order for them to trigger re-gossiping earlier than the default interval (10s). resolves #93540 Release note: None
Describe the problem
Load based lease rebalancing is causing leases to thrash when running ycsb/b
This is causing a perf regression of 15-20%.
To Reproduce
Expected behavior
Leases don't thrash.
Jira issue: CRDB-22387
The text was updated successfully, but these errors were encountered: