server: proactively rebalance decommissioning nodes' replicas #80836

aayushshah15 · 2022-05-01T03:09:44Z

Previously, when a node was marked DECOMMISSIONING, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their replicaScanner to gradually
run into the decommissioning node's ranges and rely on their
replicateQueue to then rebalance them.

This had a few issues:

It meant that even when decommissioning a mostly empty node, our
worst case lower bound for marking that node fully decommissioned was
one full scanner interval (which is 10 minutes by default).
If the replicateQueue ran into an error while rebalancing a
decommissioning replica (see kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266 for instance), it would only
retry that replica after either one full scanner interval or after
the purgatory interval. This meant that decommissioning could take
excessively long towards the tail end of the process.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be DECOMMISSIONING.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight decommissionMonitor, which is
responsible for managing the lifecycle of these async tasks.

Note: This PR is an alternative to, but subsumes, #80695. The main
difference between this patch and 80695 is that, here, every node that
learns about a decommissioning node will spin up a local nudger goroutine
for it, whereas in 80695, only the node that sent the DecommissionRequest
will spin up a nudger goroutine that then sends RPCs to other nodes in the
system to enqueue the decommissioning node's replicas.

Resolves #79453

Release note: None

cockroach-teamcity · 2022-05-01T03:09:52Z

This change is

aayushshah15 · 2022-05-02T17:41:02Z

Comparing master vs this patch on a 6 node cluster with the tpcc-200 warehouse dataset:

master

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";
0.17s user 0.15s system 0% cpu 3:47.34 total

this patch

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";
0.07s user 0.02s system 0% cpu 48.124 total

Repro steps

USER=aayushs
CLUSTER=${USER}-decommission-test;
rp destroy $CLUSTER
rp create $CLUSTER --nodes 6
./build/builder.sh mkrelease
rp put $CLUSTER ./cockroach-linux-2.6.32-gnu-amd64 cockroach
rp start $CLUSTER

rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='1GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='1GiB'";
rp ssh $CLUSTER:1 -- './cockroach workload fixtures import tpcc --warehouses=200 {pgurl:1} --checks=false'

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";

aayushshah15 · 2022-05-02T18:01:09Z

Hey @AlexTalks, this is the PR that came up in today's retro. Since you're going to be leading a lot of the decommissioning work next release, I'd appreciate any thoughts you have on this. Please let me know if any of the motivations behind this patch are unclear.

aayushshah15 · 2022-05-02T21:53:04Z

I ran a slightly larger test comparing decommissioning nodes from a cluster containing this patch vs master and it does look promising.

Running:

USER=aayushs
CLUSTER=${USER}-decommission-test;
rp destroy $CLUSTER
rp create $CLUSTER --nodes 7 --gce-machine-type=n1-standard-16
rp stage $CLUSTER cockroach
rp start $CLUSTER

rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='1GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='1GiB'";
rp ssh $CLUSTER:7 -- './cockroach workload fixtures import tpcc --warehouses=4000 {pgurl:1} --checks=false'

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";

On master:

roachprod ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self"  0.76s user 0.48s system 0% cpu 13:00.18 total

On this patch:

roachprod ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self"  0.31s user 0.45s system 0% cpu 6:34.53 total

Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None

aayushshah15 · 2022-05-04T21:21:50Z

Closing in favor of #80993

… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note: None

… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.

80993: server: react to decommissioning nodes by proactively enqueuing their replicas r=aayushshah15 a=aayushshah15 Note: This patch implements a subset of #80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Note that when nodes in the system restart, they'll re-invoke this callback for any already `DECOMMISSIONING` node. Resolves #79453 Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes. Co-authored-by: Aayush Shah <[email protected]>

… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.

kvserver: add an async parameter to Store.ManuallyEnqueue()

9e4d8a4

Release note: None

aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 3 times, most recently from 88ac70e to d54993c Compare May 2, 2022 17:34

aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 2 times, most recently from 1a7b3d2 to 9a81a3b Compare May 2, 2022 17:56

aayushshah15 marked this pull request as ready for review May 2, 2022 17:58

aayushshah15 requested review from a team as code owners May 2, 2022 17:58

aayushshah15 requested a review from a team May 2, 2022 17:58

aayushshah15 requested review from a team as code owners May 2, 2022 17:58

aayushshah15 requested review from shermanCRL, stevendanna, nvanbenschoten, AlexTalks and shralex and removed request for a team, shermanCRL and stevendanna May 2, 2022 17:58

aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 2 times, most recently from 323e720 to 3eb3179 Compare May 3, 2022 15:57

aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch from 3eb3179 to 6bb3696 Compare May 3, 2022 16:27

aayushshah15 mentioned this pull request May 4, 2022

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

Merged

aayushshah15 closed this May 4, 2022

aayushshah15 mentioned this pull request Jun 9, 2022

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: proactively rebalance decommissioning nodes' replicas #80836

server: proactively rebalance decommissioning nodes' replicas #80836

aayushshah15 commented May 1, 2022 •

edited

Loading

cockroach-teamcity commented May 1, 2022

aayushshah15 commented May 2, 2022 •

edited

Loading

aayushshah15 commented May 2, 2022

aayushshah15 commented May 2, 2022

aayushshah15 commented May 4, 2022

server: proactively rebalance decommissioning nodes' replicas #80836

server: proactively rebalance decommissioning nodes' replicas #80836

Conversation

aayushshah15 commented May 1, 2022 • edited Loading

cockroach-teamcity commented May 1, 2022

aayushshah15 commented May 2, 2022 • edited Loading

aayushshah15 commented May 2, 2022

aayushshah15 commented May 2, 2022

aayushshah15 commented May 4, 2022

aayushshah15 commented May 1, 2022 •

edited

Loading

aayushshah15 commented May 2, 2022 •

edited

Loading