kv: model decommissioning/upreplication as a global queue of work #82475

irfansharif · 2022-06-06T17:04:13Z

Is your feature request related to a problem? Please describe.

Decommissioning is slow.

Background

We're introducing system-wide benchmarks (#81565) and improving per-store queueing behaviour (#80993 + #81005), which will help identify bottle necks and address one of them. One likely one is conservative snapshot rates (#14768 + #63728), introduced pre-admission control and chosen conservatively to not overwhelm storage nodes; here too we have ideas around how to make these rates more dynamic while still preserving store health (#80607 + #75066). Another recent body of work has been around generating snapshots from followers (#42491), which for us presents as more potential sources/choices to upreplicate from during decommissions.

Current structure

High-level view of how decommissioning works:

We flip a bit on liveness record marking a node as decommission-ing;
Individual stores learn about this bit flip (the record is gossiped) and for ranges where it has the leaseholder, attempts to move a replica away from the decommissioning node to another;
Once the node being decommissioned has no more replicas, we mark the node as fully decomission-ed thus excluding it from further participation in the cluster.4.

Step (2) is the slowest part, and to try and formalize how long it's going to take:

Let R0 be be the set of ranges with replicas on the decommissioning node
R0 = R0_S1 + R0_S2 + … where R0_SN is a range with a replica on the decommissioning node + a snapshot sender (not necessarily a leaseholder) on node N
Assuming maximal sender side parallelism + no receiver side queuing, time to send all snapshots = max(bytes(R0_S1), …, bytes(R0_SN))/snapshot send rate (could also have per-R0_SN send rates).

This tells us that to go as fast as possible, you want minimize the snapshot bytes generated by the node sending the maximum number of bytes. For completeness, to understand receiver side behaviour:

Let R0 be be the set of ranges with replicas on the decommissioning node, snapshots for which need to be received
R0 = R0_R1 + R0_R2 + … where R0_RN is a range with a replica on the decommissioning node that will be moved to node N because of decommissioning
Assuming maximal receiver side parallelism + no sender side queuing, time to receive all snapshots = max(bytes(R0_R1), …, bytes(R0_RN))/snapshot receive rate (could also have per-R0_RN receive rates)

Which tells us we want to minimize the number of bytes received by the node receiving the maximum number of bytes. The overall decommissioning time is then max(time to receive all snapshots, time to send all snapshots).

Proposed structure / the solution you'd like

Looking at the above, we’re relying on uncoordinated snapshot generation per-store targeting whatever destination with little visibility on receiver side snapshot queuing. This can have bad tail properties (something perhaps #81565 helps confirm). I wonder if basic load balancer ideas apply here: we have a global queue of work to be done (send some snapshot from the set R0 to the least utilized receiver) that every sender can pull from, instead of trying to coordinate independently. I assume this becomes more pressing once we have more sources for snapshots (i.e. followers).

Additional context

See linked issues in the Background section. We're also interested in improving observability #74158. One idea here is to do it by structuring decommissioning as a job: #74158 (comment). In addition to other benefits, it gives us a place to maintain this global queue + orchestrate.

Jira issue: CRDB-16412

The text was updated successfully, but these errors were encountered:

andrewbaptist · 2022-08-25T19:14:33Z

One thing to note is that the number of nodes to send is not uniform. On the decommissioning nodes, it must send R0/C while on the other nodes each sends R0((C - 1)/C)/N (C is replicas per range, and N is nodes in the system). The intuition behind this is that only the ranges that overlap with the decommissioning node need to be moved. Since the node being decommissioned by definition overlaps with all the ranges on it, and the other nodes overlap with less, they move a lot less.

This can be addressed by first running a drain command. After a drain command, each node other than the decommissioned node will have more replicas to move, however, they will all have a similar number R0/(N-1) which is generally going to be much less than the R0/C that would need to be moved per node otherwise.

github-actions · 2024-02-19T11:04:30Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv-decom-rolling-restart Decommission and Rolling Restarts labels Jun 6, 2022

blathers-crl bot added the T-kv KV Team label Jun 6, 2022

irfansharif changed the title ~~kv: model decommissioning as a global queue of work~~ kv: model decommissioning/upreplication as a global queue of work Jul 13, 2022

AlexTalks mentioned this issue Mar 11, 2023

server: ensure that decommissioning events are written to the eventlog #98425

Open

github-actions bot added the no-issue-activity label Feb 19, 2024

kvoli removed the no-issue-activity label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: model decommissioning/upreplication as a global queue of work #82475

kv: model decommissioning/upreplication as a global queue of work #82475

irfansharif commented Jun 6, 2022 •

edited

Loading

andrewbaptist commented Aug 25, 2022

github-actions bot commented Feb 19, 2024

kv: model decommissioning/upreplication as a global queue of work #82475

kv: model decommissioning/upreplication as a global queue of work #82475

Comments

irfansharif commented Jun 6, 2022 • edited Loading

andrewbaptist commented Aug 25, 2022

github-actions bot commented Feb 19, 2024

irfansharif commented Jun 6, 2022 •

edited

Loading