Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: decommissioning is slower when adding nodes concurrently #79560

Closed
nvanbenschoten opened this issue Apr 7, 2022 · 4 comments
Closed

kv: decommissioning is slower when adding nodes concurrently #79560

nvanbenschoten opened this issue Apr 7, 2022 · 4 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Apr 7, 2022

When building out #77458, we added a variant to the benchmark suite that adds a new node to the cluster at the same time that it decommissions a node. This is a common combination, as it can be seen as "replacing a sick node with a new healthy node".

With interactions like #79249 on the mind, I had a hunch this would behave poorly. Sure enough, decommissioning a node while adding another node to the cluster at the same time increases the time before the decommissioning completes. In the following example with 32 nodes, 8 stores each, and a 256MB/s snapshot rate, decommission was about 1/3 as fast while the new node was catching up. Once the upreplication to the new node completed, decommissioning sped up.

Timeline:
01:14 — decommissioning and upreplication to new node begins
01:33 — upreplication to new node completes
01:39 — decommissioning completes

Screen Shot 2022-04-06 at 9 38 11 PM

Screen Shot 2022-04-06 at 9 39 35 PM

Explanation

When upreplicating to the new node, all decommissioning-driven rebalancing decides to rebalance from the decommissioning node to the new node. These correlated decisions bottleneck the decommission. Recall that each store can accept one snapshot at a time. Without the new node, each store in the cluster of S stores serves as a destination for 1/S replicas from the decommissioning node. With the concurrent upreplication, all replicas from the decommissioning node attempt to rebalance to the new node, forming a long queue.

Worse, these decommissioning decisions also get mixed in and queue along with other forms of rebalancing, which slows down the rebalancing of replicas from the decommissioning node further. This explains why the upreplication completes faster than the decommission.

Immediate takeaway

Running decommissioning and upreplication concurrently is probably still faster end-to-end than running one-by-one.

However, in a "sick node" scenario where the primary goal is to drain the decommissioning node and get it out of the cluster ASAP, operators should decommission first and only replace the decommissioned node after the decommissioning completes.

Potential fix (needs iteration)

This issue and #79249 both hint at a general problem where "optimal" local rebalance decisions may lead to underutilization if the execution of those decisions is delayed due to queueing. In such cases, there may often be a "close second best" choice for a rebalance which would not require any queuing.

Methods to break this herd behavior like The Power of Two Random Choices come to mind as possible changes to allocator ranking that could avoid correlated decision making across a large cluster.

In the specific case of rebalancing away from a decommissioned node (i.e. acting on a AllocatorReplaceDecommissioningVoter), we could do something simpler. When rebalancing away from a decommissioning node, we don't need to rebalance to the optimal store, we'd just like to rebalance to any reasonably valid destination. So instead of picking the store with the fewest replicas as the best candidate, we could pick a random store that matches all constraints and has less than the mean range count. Or we could come up with some other way to define a set of "good enough" candidates and pick randomly from them. This is similar to how we fixed AdminScatter by adding jitter to allocation decisions.

An emergent shared theme between this issue and the AdminScatter change is that there seem to be two kinds of rebalancing:

  1. those that care a lot about the destination and less about the source (e.g. "upreplicate to node 12")
  2. those that care a lot about the source and less about the destination (e.g. "move replicas off node 16")

In the latter category, we would benefit from being less precise during costing to avoid herd behavior and to balance decisions more evenly across sufficiently comparable candidate stores.

Jira issue: CRDB-14905

Epic CRDB-14621

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Apr 7, 2022
@nvanbenschoten nvanbenschoten changed the title kv: decommissioning slower when adding nodes concurrently kv: decommissioning is slower when adding nodes concurrently Apr 7, 2022
@nvanbenschoten
Copy link
Member Author

@aayushshah15 do you buy the part about there being two kinds of rebalancing and that Allocator.AllocateVoter should be taught about the distinction and cost candidates accordingly?

@rail
Copy link
Member

rail commented May 25, 2022

Manually synced with Jira

@AlexTalks
Copy link
Contributor

Should this be closed now that #86265 is done?

@lidorcarmel
Copy link
Contributor

I think yes, thanks Alex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants