-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: react to decommissioning nodes by proactively enqueuing their replicas #80993
server: react to decommissioning nodes by proactively enqueuing their replicas #80993
Conversation
d9c9583
to
a6237dc
Compare
a6237dc
to
cb65cf0
Compare
99477cf
to
9af9db5
Compare
68dfd78
to
6a5a854
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! This mostly looks good but I would like to take another quick glance tomorrow - will stamp then.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/store.go
line 3432 at r1 (raw file):
// to ensure that these replicas are priority-ordered first. if skipShouldQueue { queue.AddAsync(ctx, repl, 1e6 /* prio */)
Nit: can we make this a constant if we don't have one already?
6a5a854
to
e2f8dc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)
pkg/kv/kvserver/store.go
line 3432 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Nit: can we make this a constant if we don't have one already?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks good to me! Just one comment on adding obs if it doesn't already exist - perhaps a separate patch.
Reviewed 15 of 15 files at r1, 6 of 7 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/server/decommission.go
line 71 at r3 (raw file):
return true /* wantMore */ } _, processErr, enqueueErr := store.Enqueue(
It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.
You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.
I think a gauge would work here, backed by an atomic counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)
pkg/server/decommission.go
line 71 at r3 (raw file):
Previously, kvoli (Austen) wrote…
It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.
You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.
I think a gauge would work here, backed by an atomic counter?
I do think it would be really nice to have store-level metrics for how many ranges they're supposed to rebalance for any given decommissioning node, but I don't think this callback should be the place where we do that. It would be easy for a gauge maintained here to become inaccurate, for at least the following 2 reasons:
- This callback doesn't "cover all cases" -- i.e. we're not guaranteed to enqueue all ranges that have a replica on the decommissioning node because any of the ranges enqueued here could get split by the time we get around to processing them. Similarly, any ranges enqueued here could get their lease transferred away to a different store by the time this store gets around to processing them.
- This callback also only enqueues these replicas async, and doesn't wait for them to be processed.
We should think about the sort of metrics collection you're referring to inside the replicateQueue
itself, and I think it deserves its own patch.
What do you think?
7a08906
to
66d9f96
Compare
8bddc39
to
c848e3d
Compare
Merging this to close this out and not affect @AlexTalks' benchmarking + future work. TFTRs! bors r+ |
bors r- |
Canceled. |
Release note: None
c848e3d
to
a1d4d6f
Compare
… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.
a1d4d6f
to
eeb7236
Compare
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ca59db4 to blathers/backport-release-22.1-80993: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This commit fixes a bug from cockroachdb#80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None
82555: sql: fix CREATE TABLE LIKE with implicit pk r=jasonmchan a=jasonmchan Previously, `CREATE TABLE LIKE` copied implicitly created columns (e.g. for the rowid default primary key and hash sharded index). Defaults for some of these columns were not properly copied over in some cases, causing unexpected constraint violations to surface. This commit fixes this by skipping copying such columns; instead, they will be freshly created. Followup work is needed for REGIONAL BY ROW. Fixes #82401 Release note: None 82569: sql/schemachanger/rel,scplan/rules: add support for rules, _; adopt r=ajwerner a=ajwerner The first commit extends the `rel` language with support for rules and `_` and adopts it for the dep rules. The second commit contains further cleanup and adopts in op rules. Release note: None 82652: ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak r=JeffSwenson a=jaylim-crl Previously, there was a possibility where a processor can return from resuming because the client's connection was closed _before_ waitResumed even has the chance to wake up to check on the resumed field. When that happens, the connection goroutine will be blocked forever, and the CurConnCount metric will never be decremented, even if the connection has already been terminated. When the client's connection was closed, the forwarder's context will be cancelled as well. The ideal behavior would be to terminate all waiters when that happens, but the current code does not do that. This commit fixes that issue by adding a new closed state to the processors, and ensuring that the processor is closed whenever resume returns with an error. waitResumed can then check on this state before going back to wait. Release note: None 82683: server: don't re-run node decommissioning callback r=aayushshah15 a=aayushshah15 This commit fixes a bug from #80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None Co-authored-by: Jason Chan <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Aayush Shah <[email protected]>
This commit fixes a bug from cockroachdb#80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None
81005: kvserver: retry failures to rebalance decommissioning replicas r=aayushshah15 a=aayushshah15 Related to #80993 Relates to #79453 This commit makes it such that failures to rebalance replicas on decommissioning nodes no longer move the replica out of the replicateQueue as they previously used to. Instead, these failures now put these replicas into the replicateQueue's purgatory, which will retry these replicas every minute. All this is intended to improve the speed of decommissioning towards its tail end, since previously, failures to rebalance these replicas meant that they were only retried after about 10 minutes. Release note: None Co-authored-by: Aayush Shah <[email protected]>
Note: This patch implements a subset of #80836
Previously, when a node was marked
DECOMMISSIONING
, other nodes in thesystem would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their
replicaScanner
to graduallyrun into the decommissioning node's ranges and rely on their
replicateQueue
to then rebalance them.This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was one
full scanner interval (which is 10 minutes by default).
This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be
DECOMMISSIONING
.When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node. Note that when nodes in the system restart, they'll re-invoke this callback
for any already
DECOMMISSIONING
node.Resolves #79453
Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.