server: react to decommissioning nodes by proactively enqueuing their replicas #80993

aayushshah15 · 2022-05-04T18:24:37Z

Note: This patch implements a subset of #80836

Previously, when a node was marked DECOMMISSIONING, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their replicaScanner to gradually
run into the decommissioning node's ranges and rely on their
replicateQueue to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was one
full scanner interval (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be DECOMMISSIONING.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node. Note that when nodes in the system restart, they'll re-invoke this callback
for any already DECOMMISSIONING node.

Resolves #79453

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.

cockroach-teamcity · 2022-05-04T18:24:46Z

This change is

AlexTalks

Thanks for doing this! This mostly looks good but I would like to take another quick glance tomorrow - will stamp then.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/store.go line 3432 at r1 (raw file):

		// to ensure that these replicas are priority-ordered first.
		if skipShouldQueue {
			queue.AddAsync(ctx, repl, 1e6 /* prio */)

Nit: can we make this a constant if we don't have one already?

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)

pkg/kv/kvserver/store.go line 3432 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Nit: can we make this a constant if we don't have one already?

Done.

kvoli

Nice, looks good to me! Just one comment on adding obs if it doesn't already exist - perhaps a separate patch.

Reviewed 15 of 15 files at r1, 6 of 7 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/server/decommission.go line 71 at r3 (raw file):

						return true /* wantMore */
					}
					_, processErr, enqueueErr := store.Enqueue(

It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.

You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.

I think a gauge would work here, backed by an atomic counter?

kvoli

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)

pkg/server/decommission.go line 71 at r3 (raw file):

Previously, kvoli (Austen) wrote…

It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.

You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.

I think a gauge would work here, backed by an atomic counter?

I do think it would be really nice to have store-level metrics for how many ranges they're supposed to rebalance for any given decommissioning node, but I don't think this callback should be the place where we do that. It would be easy for a gauge maintained here to become inaccurate, for at least the following 2 reasons:

This callback doesn't "cover all cases" -- i.e. we're not guaranteed to enqueue all ranges that have a replica on the decommissioning node because any of the ranges enqueued here could get split by the time we get around to processing them. Similarly, any ranges enqueued here could get their lease transferred away to a different store by the time this store gets around to processing them.
This callback also only enqueues these replicas async, and doesn't wait for them to be processed.

We should think about the sort of metrics collection you're referring to inside the replicateQueue itself, and I think it deserves its own patch.

What do you think?

aayushshah15 · 2022-06-07T18:19:48Z

Merging this to close this out and not affect @AlexTalks' benchmarking + future work.

TFTRs!

bors r+

aayushshah15 · 2022-06-07T18:33:10Z

bors r-

craig · 2022-06-07T18:33:12Z

Canceled.

Release note: None

… replicas Note: This patch implements a subset of cockroachdb#80836 Previously, when a node was marked `DECOMMISSIONING`, other nodes in the system would learn about it via gossip but wouldn't do much in the way of reacting to it. They'd rely on their `replicaScanner` to gradually run into the decommissioning node's ranges and rely on their `replicateQueue` to then rebalance them. This meant that even when decommissioning a mostly empty node, our worst case lower bound for marking that node fully decommissioned was _one full scanner interval_ (which is 10 minutes by default). This patch improves this behavior by installing an idempotent callback that is invoked every time a node is detected to be `DECOMMISSIONING`. When it is run, the callback enqueues all the replicas on the local stores that are on ranges that also have replicas on the decommissioning node. Release note (performance improvement): Decommissioning should now be substantially faster, particularly for small to moderately loaded nodes.

aayushshah15 · 2022-06-08T16:28:38Z

bors r+

craig · 2022-06-08T17:27:32Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-06-08T18:24:27Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-06-08T18:24:38Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from ca59db4 to blathers/backport-release-22.1-80993: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

This commit fixes a bug from cockroachdb#80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None

82555: sql: fix CREATE TABLE LIKE with implicit pk r=jasonmchan a=jasonmchan Previously, `CREATE TABLE LIKE` copied implicitly created columns (e.g. for the rowid default primary key and hash sharded index). Defaults for some of these columns were not properly copied over in some cases, causing unexpected constraint violations to surface. This commit fixes this by skipping copying such columns; instead, they will be freshly created. Followup work is needed for REGIONAL BY ROW. Fixes #82401 Release note: None 82569: sql/schemachanger/rel,scplan/rules: add support for rules, _; adopt r=ajwerner a=ajwerner The first commit extends the `rel` language with support for rules and `_` and adopts it for the dep rules. The second commit contains further cleanup and adopts in op rules. Release note: None 82652: ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak r=JeffSwenson a=jaylim-crl Previously, there was a possibility where a processor can return from resuming because the client's connection was closed _before_ waitResumed even has the chance to wake up to check on the resumed field. When that happens, the connection goroutine will be blocked forever, and the CurConnCount metric will never be decremented, even if the connection has already been terminated. When the client's connection was closed, the forwarder's context will be cancelled as well. The ideal behavior would be to terminate all waiters when that happens, but the current code does not do that. This commit fixes that issue by adding a new closed state to the processors, and ensuring that the processor is closed whenever resume returns with an error. waitResumed can then check on this state before going back to wait. Release note: None 82683: server: don't re-run node decommissioning callback r=aayushshah15 a=aayushshah15 This commit fixes a bug from #80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None Co-authored-by: Jason Chan <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Aayush Shah <[email protected]>

This commit fixes a bug from cockroachdb#80993. Without this commit, nodes might re-run the callback to enqueue a decommissioning node's ranges into their replicate queues if they received a gossip update from that decommissioning node that was perceived to be newer. Re-running this callback on every newer gossip update from a decommissioning node will be too expensive for nodes with a lot of replicas. Release note: None

81005: kvserver: retry failures to rebalance decommissioning replicas r=aayushshah15 a=aayushshah15 Related to #80993 Relates to #79453 This commit makes it such that failures to rebalance replicas on decommissioning nodes no longer move the replica out of the replicateQueue as they previously used to. Instead, these failures now put these replicas into the replicateQueue's purgatory, which will retry these replicas every minute. All this is intended to improve the speed of decommissioning towards its tail end, since previously, failures to rebalance these replicas meant that they were only retried after about 10 minutes. Release note: None Co-authored-by: Aayush Shah <[email protected]>

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 3 times, most recently from d9c9583 to a6237dc Compare May 4, 2022 19:15

aayushshah15 requested a review from AlexTalks May 4, 2022 19:17

aayushshah15 added the backport-22.1.x label May 4, 2022

aayushshah15 requested a review from nvanbenschoten May 4, 2022 19:17

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from a6237dc to cb65cf0 Compare May 4, 2022 19:18

aayushshah15 mentioned this pull request May 4, 2022

kvserver: retry failures to rebalance decommissioning replicas #81005

Merged

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 99477cf to 9af9db5 Compare May 4, 2022 21:20

aayushshah15 mentioned this pull request May 4, 2022

server: proactively rebalance decommissioning nodes' replicas #80836

Closed

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 68dfd78 to 6a5a854 Compare May 4, 2022 21:30

aayushshah15 mentioned this pull request May 4, 2022

server: introduce a decommission monitor task #80695

Closed

aayushshah15 marked this pull request as ready for review May 4, 2022 21:50

aayushshah15 requested review from a team as code owners May 4, 2022 21:50

aayushshah15 requested a review from a team May 4, 2022 21:50

aayushshah15 requested a review from a team as a code owner May 4, 2022 21:50

aayushshah15 requested review from msbutler and kvoli and removed request for a team and msbutler May 4, 2022 21:50

AlexTalks reviewed May 5, 2022

View reviewed changes

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from 6a5a854 to e2f8dc2 Compare May 5, 2022 19:32

aayushshah15 commented May 5, 2022

View reviewed changes

kvoli approved these changes May 8, 2022

View reviewed changes

kvoli reviewed May 8, 2022

View reviewed changes

aayushshah15 commented May 19, 2022

View reviewed changes

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from 7a08906 to 66d9f96 Compare May 19, 2022 23:12

irfansharif mentioned this pull request Jun 6, 2022

kv: model decommissioning/upreplication as a global queue of work #82475

Open

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 8bddc39 to c848e3d Compare June 7, 2022 18:04

kvserver: add an async parameter to Store.ManuallyEnqueue()

ca59db4

Release note: None

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from c848e3d to a1d4d6f Compare June 8, 2022 15:49

aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from a1d4d6f to eeb7236 Compare June 8, 2022 16:26

craig bot merged commit e9456ba into cockroachdb:master Jun 8, 2022

cockroach-teamcity mentioned this pull request Jun 8, 2022

server: react to decommissioning nodes by proactively enqueuing their replicas cockroachdb/docs#14140

Closed

aayushshah15 mentioned this pull request Jun 9, 2022

release-22.1: server: react to decommissioning nodes by proactively enqueuing their replicas #82680

Merged

aayushshah15 mentioned this pull request Jun 9, 2022

server: don't re-run node decommissioning callback #82683

Merged

aayushshah15 mentioned this pull request Jun 12, 2022

release-22.1: kvserver: retry failures to rebalance decommissioning replicas #82800

Merged

aayushshah15 mentioned this pull request Jun 21, 2022

roachtest: decommission/nodes=4/duration=1h0m0s failed #83060

Closed

arulajmani mentioned this pull request Sep 5, 2024

kvserver: enqueue decom ranges at an interval behind a setting #130117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

aayushshah15 commented May 4, 2022 •

edited

Loading

cockroach-teamcity commented May 4, 2022

AlexTalks left a comment

aayushshah15 left a comment

kvoli left a comment

kvoli left a comment

aayushshah15 left a comment

aayushshah15 commented Jun 7, 2022

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 commented Jun 8, 2022

craig bot commented Jun 8, 2022

craig bot commented Jun 8, 2022

blathers-crl bot commented Jun 8, 2022

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

Conversation

aayushshah15 commented May 4, 2022 • edited Loading

cockroach-teamcity commented May 4, 2022

AlexTalks left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

aayushshah15 commented Jun 7, 2022

aayushshah15 commented Jun 7, 2022

craig bot commented Jun 7, 2022

aayushshah15 commented Jun 8, 2022

craig bot commented Jun 8, 2022

craig bot commented Jun 8, 2022

blathers-crl bot commented Jun 8, 2022

aayushshah15 commented May 4, 2022 •

edited

Loading