Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: proactively rebalance decommissioning nodes' replicas #80836

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented May 1, 2022

Previously, when a node was marked DECOMMISSIONING, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their replicaScanner to gradually
run into the decommissioning node's ranges and rely on their
replicateQueue to then rebalance them.

This had a few issues:

  1. It meant that even when decommissioning a mostly empty node, our
    worst case lower bound for marking that node fully decommissioned was
    one full scanner interval (which is 10 minutes by default).
  2. If the replicateQueue ran into an error while rebalancing a
    decommissioning replica (see kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266 for instance), it would only
    retry that replica after either one full scanner interval or after
    the purgatory interval. This meant that decommissioning could take
    excessively long towards the tail end of the process.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be DECOMMISSIONING.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight decommissionMonitor, which is
responsible for managing the lifecycle of these async tasks.

Note: This PR is an alternative to, but subsumes, #80695. The main
difference between this patch and 80695 is that, here, every node that
learns about a decommissioning node will spin up a local nudger goroutine
for it, whereas in 80695, only the node that sent the DecommissionRequest
will spin up a nudger goroutine that then sends RPCs to other nodes in the
system to enqueue the decommissioning node's replicas.

Resolves #79453

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 3 times, most recently from 88ac70e to d54993c Compare May 2, 2022 17:34
@aayushshah15
Copy link
Contributor Author

aayushshah15 commented May 2, 2022

Comparing master vs this patch on a 6 node cluster with the tpcc-200 warehouse dataset:

master

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";
0.17s user 0.15s system 0% cpu 3:47.34 total

this patch

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";
0.07s user 0.02s system 0% cpu 48.124 total
Repro steps
USER=aayushs
CLUSTER=${USER}-decommission-test;
rp destroy $CLUSTER
rp create $CLUSTER --nodes 6
./build/builder.sh mkrelease
rp put $CLUSTER ./cockroach-linux-2.6.32-gnu-amd64 cockroach
rp start $CLUSTER

rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='1GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='1GiB'";
rp ssh $CLUSTER:1 -- './cockroach workload fixtures import tpcc --warehouses=200 {pgurl:1} --checks=false'

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";

@aayushshah15 aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 2 times, most recently from 1a7b3d2 to 9a81a3b Compare May 2, 2022 17:56
@aayushshah15 aayushshah15 marked this pull request as ready for review May 2, 2022 17:58
@aayushshah15 aayushshah15 requested review from a team as code owners May 2, 2022 17:58
@aayushshah15 aayushshah15 requested a review from a team May 2, 2022 17:58
@aayushshah15 aayushshah15 requested review from a team as code owners May 2, 2022 17:58
@aayushshah15 aayushshah15 requested review from shermanCRL, stevendanna, nvanbenschoten, AlexTalks and shralex and removed request for a team, shermanCRL and stevendanna May 2, 2022 17:58
@aayushshah15
Copy link
Contributor Author

Hey @AlexTalks, this is the PR that came up in today's retro. Since you're going to be leading a lot of the decommissioning work next release, I'd appreciate any thoughts you have on this. Please let me know if any of the motivations behind this patch are unclear.

@aayushshah15
Copy link
Contributor Author

I ran a slightly larger test comparing decommissioning nodes from a cluster containing this patch vs master and it does look promising.

Running:

USER=aayushs
CLUSTER=${USER}-decommission-test;
rp destroy $CLUSTER
rp create $CLUSTER --nodes 7 --gce-machine-type=n1-standard-16
rp stage $CLUSTER cockroach
rp start $CLUSTER

rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='1GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='1GiB'";
rp ssh $CLUSTER:7 -- './cockroach workload fixtures import tpcc --warehouses=4000 {pgurl:1} --checks=false'

time rp ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self";

On master:

roachprod ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self"  0.76s user 0.48s system 0% cpu 13:00.18 total

On this patch:

roachprod ssh $CLUSTER:1 -- "./cockroach node decommission --insecure --self"  0.31s user 0.45s system 0% cpu 6:34.53 total

@aayushshah15 aayushshah15 force-pushed the 20220429_reactToDecommissioningNode branch 2 times, most recently from 323e720 to 3eb3179 Compare May 3, 2022 15:57
Note: This PR is an alternative to, but subsumes,
cockroachdb#80695.

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This had a few issues:
1. It meant that even when decommissioning a mostly empty node, our
   worst case lower bound for marking that node fully decommissioned was
   _one full scanner interval_ (which is 10 minutes by default).
2. If the replicateQueue ran into an error while rebalancing a
   decommissioning replica (see cockroachdb#79266 for instance), it would only
   retry that replica after either one full scanner interval or after
   the purgatory interval.

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).

All this is managed by a lightweight `decommissionMonitor`, which is
responsible for managing the lifecycle of these async tasks.

Release note: None
@aayushshah15
Copy link
Contributor Author

Closing in favor of #80993

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 4, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 5, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 17, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 7, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 8, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.
craig bot pushed a commit that referenced this pull request Jun 8, 2022
80993: server: react to decommissioning nodes by proactively enqueuing their replicas r=aayushshah15 a=aayushshah15

Note: This patch implements a subset of #80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node. Note that when nodes in the system restart, they'll re-invoke this callback
for any already `DECOMMISSIONING` node. 

Resolves #79453

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.


Co-authored-by: Aayush Shah <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 9, 2022
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: proactively enqueue replicas for a decommissioning node
2 participants