kv: decommissioning is slower when adding nodes concurrently #79560

nvanbenschoten · 2022-04-07T03:14:25Z

When building out #77458, we added a variant to the benchmark suite that adds a new node to the cluster at the same time that it decommissions a node. This is a common combination, as it can be seen as "replacing a sick node with a new healthy node".

With interactions like #79249 on the mind, I had a hunch this would behave poorly. Sure enough, decommissioning a node while adding another node to the cluster at the same time increases the time before the decommissioning completes. In the following example with 32 nodes, 8 stores each, and a 256MB/s snapshot rate, decommission was about 1/3 as fast while the new node was catching up. Once the upreplication to the new node completed, decommissioning sped up.

Timeline:
01:14 — decommissioning and upreplication to new node begins
01:33 — upreplication to new node completes
01:39 — decommissioning completes

Explanation

When upreplicating to the new node, all decommissioning-driven rebalancing decides to rebalance from the decommissioning node to the new node. These correlated decisions bottleneck the decommission. Recall that each store can accept one snapshot at a time. Without the new node, each store in the cluster of S stores serves as a destination for 1/S replicas from the decommissioning node. With the concurrent upreplication, all replicas from the decommissioning node attempt to rebalance to the new node, forming a long queue.

Worse, these decommissioning decisions also get mixed in and queue along with other forms of rebalancing, which slows down the rebalancing of replicas from the decommissioning node further. This explains why the upreplication completes faster than the decommission.

Immediate takeaway

Running decommissioning and upreplication concurrently is probably still faster end-to-end than running one-by-one.

However, in a "sick node" scenario where the primary goal is to drain the decommissioning node and get it out of the cluster ASAP, operators should decommission first and only replace the decommissioned node after the decommissioning completes.

Potential fix (needs iteration)

This issue and #79249 both hint at a general problem where "optimal" local rebalance decisions may lead to underutilization if the execution of those decisions is delayed due to queueing. In such cases, there may often be a "close second best" choice for a rebalance which would not require any queuing.

Methods to break this herd behavior like The Power of Two Random Choices come to mind as possible changes to allocator ranking that could avoid correlated decision making across a large cluster.

In the specific case of rebalancing away from a decommissioned node (i.e. acting on a AllocatorReplaceDecommissioningVoter), we could do something simpler. When rebalancing away from a decommissioning node, we don't need to rebalance to the optimal store, we'd just like to rebalance to any reasonably valid destination. So instead of picking the store with the fewest replicas as the best candidate, we could pick a random store that matches all constraints and has less than the mean range count. Or we could come up with some other way to define a set of "good enough" candidates and pick randomly from them. This is similar to how we fixed AdminScatter by adding jitter to allocation decisions.

An emergent shared theme between this issue and the AdminScatter change is that there seem to be two kinds of rebalancing:

those that care a lot about the destination and less about the source (e.g. "upreplicate to node 12")
those that care a lot about the source and less about the destination (e.g. "move replicas off node 16")

In the latter category, we would benefit from being less precise during costing to avoid herd behavior and to balance decisions more evenly across sufficiently comparable candidate stores.

Jira issue: CRDB-14905

Epic CRDB-14621

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2022-04-07T17:50:31Z

@aayushshah15 do you buy the part about there being two kinds of rebalancing and that Allocator.AllocateVoter should be taught about the distinction and cost candidates accordingly?

rail · 2022-05-25T12:39:49Z

Manually synced with Jira

AlexTalks · 2022-09-10T00:34:30Z

Should this be closed now that #86265 is done?

lidorcarmel · 2022-09-12T16:46:33Z

I think yes, thanks Alex.

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Apr 7, 2022

nvanbenschoten changed the title ~~kv: decommissioning slower when adding nodes concurrently~~ kv: decommissioning is slower when adding nodes concurrently Apr 7, 2022

jlinder added the sync-me-3 label May 24, 2022

jlinder removed the sync-me-3 label May 27, 2022

lidorcarmel mentioned this issue Jun 10, 2022

Avoid rebalancing from many sources into a single target node/store #82759

Closed

lidorcarmel closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: decommissioning is slower when adding nodes concurrently #79560

kv: decommissioning is slower when adding nodes concurrently #79560

nvanbenschoten commented Apr 7, 2022 •

edited by exalate-issue-sync bot

Loading

nvanbenschoten commented Apr 7, 2022

rail commented May 25, 2022

AlexTalks commented Sep 10, 2022

lidorcarmel commented Sep 12, 2022

kv: decommissioning is slower when adding nodes concurrently #79560

kv: decommissioning is slower when adding nodes concurrently #79560

Comments

nvanbenschoten commented Apr 7, 2022 • edited by exalate-issue-sync bot Loading

Explanation

Immediate takeaway

Potential fix (needs iteration)

nvanbenschoten commented Apr 7, 2022

rail commented May 25, 2022

AlexTalks commented Sep 10, 2022

lidorcarmel commented Sep 12, 2022

nvanbenschoten commented Apr 7, 2022 •

edited by exalate-issue-sync bot

Loading