kv: decommissioning is slower when adding nodes concurrently #79560
Labels
A-kv-distribution
Relating to rebalancing and leasing.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
When building out #77458, we added a variant to the benchmark suite that adds a new node to the cluster at the same time that it decommissions a node. This is a common combination, as it can be seen as "replacing a sick node with a new healthy node".
With interactions like #79249 on the mind, I had a hunch this would behave poorly. Sure enough, decommissioning a node while adding another node to the cluster at the same time increases the time before the decommissioning completes. In the following example with 32 nodes, 8 stores each, and a 256MB/s snapshot rate, decommission was about 1/3 as fast while the new node was catching up. Once the upreplication to the new node completed, decommissioning sped up.
Explanation
When upreplicating to the new node, all decommissioning-driven rebalancing decides to rebalance from the decommissioning node to the new node. These correlated decisions bottleneck the decommission. Recall that each store can accept one snapshot at a time. Without the new node, each store in the cluster of S stores serves as a destination for 1/S replicas from the decommissioning node. With the concurrent upreplication, all replicas from the decommissioning node attempt to rebalance to the new node, forming a long queue.
Worse, these decommissioning decisions also get mixed in and queue along with other forms of rebalancing, which slows down the rebalancing of replicas from the decommissioning node further. This explains why the upreplication completes faster than the decommission.
Immediate takeaway
Running decommissioning and upreplication concurrently is probably still faster end-to-end than running one-by-one.
However, in a "sick node" scenario where the primary goal is to drain the decommissioning node and get it out of the cluster ASAP, operators should decommission first and only replace the decommissioned node after the decommissioning completes.
Potential fix (needs iteration)
This issue and #79249 both hint at a general problem where "optimal" local rebalance decisions may lead to underutilization if the execution of those decisions is delayed due to queueing. In such cases, there may often be a "close second best" choice for a rebalance which would not require any queuing.
Methods to break this herd behavior like The Power of Two Random Choices come to mind as possible changes to allocator ranking that could avoid correlated decision making across a large cluster.
In the specific case of rebalancing away from a decommissioned node (i.e. acting on a
AllocatorReplaceDecommissioningVoter
), we could do something simpler. When rebalancing away from a decommissioning node, we don't need to rebalance to the optimal store, we'd just like to rebalance to any reasonably valid destination. So instead of picking the store with the fewest replicas as the best candidate, we could pick a random store that matches all constraints and has less than the mean range count. Or we could come up with some other way to define a set of "good enough" candidates and pick randomly from them. This is similar to how we fixed AdminScatter by adding jitter to allocation decisions.An emergent shared theme between this issue and the AdminScatter change is that there seem to be two kinds of rebalancing:
In the latter category, we would benefit from being less precise during costing to avoid herd behavior and to balance decisions more evenly across sufficiently comparable candidate stores.
Jira issue: CRDB-14905
Epic CRDB-14621
The text was updated successfully, but these errors were encountered: