Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: model decommissioning/upreplication as a global queue of work #82475

Open
irfansharif opened this issue Jun 6, 2022 · 2 comments
Open

kv: model decommissioning/upreplication as a global queue of work #82475

irfansharif opened this issue Jun 6, 2022 · 2 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jun 6, 2022

Is your feature request related to a problem? Please describe.

Decommissioning is slow.

Background

We're introducing system-wide benchmarks (#81565) and improving per-store queueing behaviour (#80993 + #81005), which will help identify bottle necks and address one of them. One likely one is conservative snapshot rates (#14768 + #63728), introduced pre-admission control and chosen conservatively to not overwhelm storage nodes; here too we have ideas around how to make these rates more dynamic while still preserving store health (#80607 + #75066). Another recent body of work has been around generating snapshots from followers (#42491), which for us presents as more potential sources/choices to upreplicate from during decommissions.

Current structure

High-level view of how decommissioning works:

  1. We flip a bit on liveness record marking a node as decommission-ing;
  2. Individual stores learn about this bit flip (the record is gossiped) and for ranges where it has the leaseholder, attempts to move a replica away from the decommissioning node to another;
  3. Once the node being decommissioned has no more replicas, we mark the node as fully decomission-ed thus excluding it from further participation in the cluster.4.

Step (2) is the slowest part, and to try and formalize how long it's going to take:

  • Let R0 be be the set of ranges with replicas on the decommissioning node
  • R0 = R0_S1 + R0_S2 + … where R0_SN is a range with a replica on the decommissioning node + a snapshot sender (not necessarily a leaseholder) on node N
  • Assuming maximal sender side parallelism + no receiver side queuing, time to send all snapshots = max(bytes(R0_S1), …, bytes(R0_SN))/snapshot send rate (could also have per-R0_SN send rates).

This tells us that to go as fast as possible, you want minimize the snapshot bytes generated by the node sending the maximum number of bytes. For completeness, to understand receiver side behaviour:

  • Let R0 be be the set of ranges with replicas on the decommissioning node, snapshots for which need to be received
  • R0 = R0_R1 + R0_R2 + … where R0_RN is a range with a replica on the decommissioning node that will be moved to node N because of decommissioning
  • Assuming maximal receiver side parallelism + no sender side queuing, time to receive all snapshots = max(bytes(R0_R1), …, bytes(R0_RN))/snapshot receive rate (could also have per-R0_RN receive rates)

Which tells us we want to minimize the number of bytes received by the node receiving the maximum number of bytes. The overall decommissioning time is then max(time to receive all snapshots, time to send all snapshots).

Proposed structure / the solution you'd like

Looking at the above, we’re relying on uncoordinated snapshot generation per-store targeting whatever destination with little visibility on receiver side snapshot queuing. This can have bad tail properties (something perhaps #81565 helps confirm). I wonder if basic load balancer ideas apply here: we have a global queue of work to be done (send some snapshot from the set R0 to the least utilized receiver) that every sender can pull from, instead of trying to coordinate independently. I assume this becomes more pressing once we have more sources for snapshots (i.e. followers).

Additional context

See linked issues in the Background section. We're also interested in improving observability #74158. One idea here is to do it by structuring decommissioning as a job: #74158 (comment). In addition to other benefits, it gives us a place to maintain this global queue + orchestrate.

Jira issue: CRDB-16412

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv-decom-rolling-restart Decommission and Rolling Restarts labels Jun 6, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 6, 2022
@irfansharif irfansharif changed the title kv: model decommissioning as a global queue of work kv: model decommissioning/upreplication as a global queue of work Jul 13, 2022
@andrewbaptist
Copy link
Collaborator

One thing to note is that the number of nodes to send is not uniform. On the decommissioning nodes, it must send R0/C while on the other nodes each sends R0((C - 1)/C)/N (C is replicas per range, and N is nodes in the system). The intuition behind this is that only the ranges that overlap with the decommissioning node need to be moved. Since the node being decommissioned by definition overlaps with all the ranges on it, and the other nodes overlap with less, they move a lot less.

This can be addressed by first running a drain command. After a drain command, each node other than the decommissioned node will have more replicas to move, however, they will all have a similar number R0/(N-1) which is generally going to be much less than the R0/C that would need to be moved per node otherwise.

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants