-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: proactively rebalance decommissioning nodes' replicas #80836
server: proactively rebalance decommissioning nodes' replicas #80836
Conversation
Release note: None
88ac70e
to
d54993c
Compare
Comparing
Repro steps
|
1a7b3d2
to
9a81a3b
Compare
Hey @AlexTalks, this is the PR that came up in today's retro. Since you're going to be leading a lot of the decommissioning work next release, I'd appreciate any thoughts you have on this. Please let me know if any of the motivations behind this patch are unclear. |
I ran a slightly larger test comparing decommissioning nodes from a cluster containing this patch vs Running:
On master:
On this patch:
|
323e720
to
3eb3179
Compare
Note: This PR is an alternative to, but subsumes, cockroachdb#80695. Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This had a few issues: 1. It meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). 2. If the replicateQueue ran into an error while rebalancing a decommissioning replica (see cockroachdb#79266 for instance), it would only retry that replica after either one full scanner interval or after the purgatory interval. This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. This callback spins up an async task that will first proactively enqueue all of the decommissioning nodes ranges (that have a replica on the local node) into the local node's replicateQueues. Then, this task will periodically nudge the decommissioning node's straggling replicas in order to requeue them (to alleviate (2) from above). All this is managed by a lightweight `decommissionMonitor`, which is responsible for managing the lifecycle of these async tasks. Release note: None
3eb3179
to
6bb3696
Compare
Closing in favor of #80993 |
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note: None
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note: None
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note: None
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.
80993: server: react to decommissioning nodes by proactively enqueuing their replicas r=aayushshah15 a=aayushshah15 Note: This patch implements a subset of #80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Note that when nodes in the system restart, they'll re-invoke this callback for any already `DECOMMISSIONING` node. Resolves #79453 Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes. Co-authored-by: Aayush Shah <[email protected]>
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.
Previously, when a node was marked
DECOMMISSIONING
, other nodes in thesystem would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their
replicaScanner
to graduallyrun into the decommissioning node's ranges and rely on their
replicateQueue
to then rebalance them.This had a few issues:
worst case lower bound for marking that node fully decommissioned was
one full scanner interval (which is 10 minutes by default).
decommissioning replica (see kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266 for instance), it would only
retry that replica after either one full scanner interval or after
the purgatory interval. This meant that decommissioning could take
excessively long towards the tail end of the process.
This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be
DECOMMISSIONING
.This callback spins up an async task that will first proactively enqueue
all of the decommissioning nodes ranges (that have a replica on the
local node) into the local node's replicateQueues. Then, this task will
periodically nudge the decommissioning node's straggling replicas in
order to requeue them (to alleviate (2) from above).
All this is managed by a lightweight
decommissionMonitor
, which isresponsible for managing the lifecycle of these async tasks.
Note: This PR is an alternative to, but subsumes, #80695. The main
difference between this patch and 80695 is that, here, every node that
learns about a decommissioning node will spin up a local nudger goroutine
for it, whereas in 80695, only the node that sent the
DecommissionRequest
will spin up a nudger goroutine that then sends RPCs to other nodes in the
system to enqueue the decommissioning node's replicas.
Resolves #79453
Release note: None