kv: model decommissioning/upreplication as a global queue of work #82475
Labels
A-kv-decom-rolling-restart
Decommission and Rolling Restarts
A-kv-distribution
Relating to rebalancing and leasing.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Is your feature request related to a problem? Please describe.
Decommissioning is slow.
Background
We're introducing system-wide benchmarks (#81565) and improving per-store queueing behaviour (#80993 + #81005), which will help identify bottle necks and address one of them. One likely one is conservative snapshot rates (#14768 + #63728), introduced pre-admission control and chosen conservatively to not overwhelm storage nodes; here too we have ideas around how to make these rates more dynamic while still preserving store health (#80607 + #75066). Another recent body of work has been around generating snapshots from followers (#42491), which for us presents as more potential sources/choices to upreplicate from during decommissions.
Current structure
High-level view of how decommissioning works:
Step (2) is the slowest part, and to try and formalize how long it's going to take:
R0
be be the set of ranges with replicas on the decommissioning nodeR0 = R0_S1 + R0_S2 + …
whereR0_SN
is a range with a replica on the decommissioning node + a snapshot sender (not necessarily a leaseholder) on nodeN
time to send all snapshots = max(bytes(R0_S1), …, bytes(R0_SN))/snapshot send rate
(could also have per-R0_SN
send rates).This tells us that to go as fast as possible, you want minimize the snapshot bytes generated by the node sending the maximum number of bytes. For completeness, to understand receiver side behaviour:
R0
be be the set of ranges with replicas on the decommissioning node, snapshots for which need to be receivedR0 = R0_R1 + R0_R2 + …
whereR0_RN
is a range with a replica on the decommissioning node that will be moved to nodeN
because of decommissioningtime to receive all snapshots = max(bytes(R0_R1), …, bytes(R0_RN))/snapshot receive rate
(could also have per-R0_RN
receive rates)Which tells us we want to minimize the number of bytes received by the node receiving the maximum number of bytes. The overall decommissioning time is then
max(time to receive all snapshots, time to send all snapshots)
.Proposed structure / the solution you'd like
Looking at the above, we’re relying on uncoordinated snapshot generation per-store targeting whatever destination with little visibility on receiver side snapshot queuing. This can have bad tail properties (something perhaps #81565 helps confirm). I wonder if basic load balancer ideas apply here: we have a global queue of work to be done (send some snapshot from the set
R0
to the least utilized receiver) that every sender can pull from, instead of trying to coordinate independently. I assume this becomes more pressing once we have more sources for snapshots (i.e. followers).Additional context
See linked issues in the Background section. We're also interested in improving observability #74158. One idea here is to do it by structuring decommissioning as a job: #74158 (comment). In addition to other benefits, it gives us a place to maintain this global queue + orchestrate.
Jira issue: CRDB-16412
The text was updated successfully, but these errors were encountered: