Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: react to decommissioning nodes by proactively enqueuing their replicas #80993

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented May 4, 2022

Note: This patch implements a subset of #80836

Previously, when a node was marked DECOMMISSIONING, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their replicaScanner to gradually
run into the decommissioning node's ranges and rely on their
replicateQueue to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was one
full scanner interval
(which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be DECOMMISSIONING.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node. Note that when nodes in the system restart, they'll re-invoke this callback
for any already DECOMMISSIONING node.

Resolves #79453

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 3 times, most recently from d9c9583 to a6237dc Compare May 4, 2022 19:15
@aayushshah15 aayushshah15 requested a review from AlexTalks May 4, 2022 19:17
@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from a6237dc to cb65cf0 Compare May 4, 2022 19:18
@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 99477cf to 9af9db5 Compare May 4, 2022 21:20
@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 68dfd78 to 6a5a854 Compare May 4, 2022 21:30
@aayushshah15 aayushshah15 marked this pull request as ready for review May 4, 2022 21:50
@aayushshah15 aayushshah15 requested review from a team as code owners May 4, 2022 21:50
@aayushshah15 aayushshah15 requested a review from a team May 4, 2022 21:50
@aayushshah15 aayushshah15 requested a review from a team as a code owner May 4, 2022 21:50
@aayushshah15 aayushshah15 requested review from msbutler and kvoli and removed request for a team and msbutler May 4, 2022 21:50
Copy link
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! This mostly looks good but I would like to take another quick glance tomorrow - will stamp then.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/store.go line 3432 at r1 (raw file):

		// to ensure that these replicas are priority-ordered first.
		if skipShouldQueue {
			queue.AddAsync(ctx, repl, 1e6 /* prio */)

Nit: can we make this a constant if we don't have one already?

@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from 6a5a854 to e2f8dc2 Compare May 5, 2022 19:32
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @nvanbenschoten)


pkg/kv/kvserver/store.go line 3432 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Nit: can we make this a constant if we don't have one already?

Done.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good to me! Just one comment on adding obs if it doesn't already exist - perhaps a separate patch.

Reviewed 15 of 15 files at r1, 6 of 7 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/server/decommission.go line 71 at r3 (raw file):

						return true /* wantMore */
					}
					_, processErr, enqueueErr := store.Enqueue(

It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.

You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.

I think a gauge would work here, backed by an atomic counter?

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @AlexTalks, @kvoli, and @nvanbenschoten)


pkg/server/decommission.go line 71 at r3 (raw file):

Previously, kvoli (Austen) wrote…

It would be nice to also have a metric that tracks the count of replicas that are currently being decommissioned by leaseholders on this store. This could be reported per store and incremented when queued, decremented on success (assuming this doesn't already exist). onNodeDecomissioned could hard clear the counter if there were remainders.

You could also tag it with the NodeID that is being decommissioned, in the case where there are multiple and it may be necessary to distinguish.

I think a gauge would work here, backed by an atomic counter?

I do think it would be really nice to have store-level metrics for how many ranges they're supposed to rebalance for any given decommissioning node, but I don't think this callback should be the place where we do that. It would be easy for a gauge maintained here to become inaccurate, for at least the following 2 reasons:

  1. This callback doesn't "cover all cases" -- i.e. we're not guaranteed to enqueue all ranges that have a replica on the decommissioning node because any of the ranges enqueued here could get split by the time we get around to processing them. Similarly, any ranges enqueued here could get their lease transferred away to a different store by the time this store gets around to processing them.
  2. This callback also only enqueues these replicas async, and doesn't wait for them to be processed.

We should think about the sort of metrics collection you're referring to inside the replicateQueue itself, and I think it deserves its own patch.

What do you think?

@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from 7a08906 to 66d9f96 Compare May 19, 2022 23:12
@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch 2 times, most recently from 8bddc39 to c848e3d Compare June 7, 2022 18:04
@aayushshah15
Copy link
Contributor Author

Merging this to close this out and not affect @AlexTalks' benchmarking + future work.

TFTRs!

bors r+

@aayushshah15
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Jun 7, 2022

Canceled.

@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from c848e3d to a1d4d6f Compare June 8, 2022 15:49
… replicas

Note: This patch implements a subset of cockroachdb#80836

Previously, when a node was marked `DECOMMISSIONING`, other nodes in the
system would learn about it via gossip but wouldn't do much in the way
of reacting to it. They'd rely on their `replicaScanner` to gradually
run into the decommissioning node's ranges and rely on their
`replicateQueue` to then rebalance them.

This meant that even when decommissioning a mostly empty node, our worst
case lower bound for marking that node fully decommissioned was _one
full scanner interval_ (which is 10 minutes by default).

This patch improves this behavior by installing an idempotent callback
that is invoked every time a node is detected to be `DECOMMISSIONING`.
When it is run, the callback enqueues all the replicas on the local
stores that are on ranges that also have replicas on the decommissioning
node.

Release note (performance improvement): Decommissioning should now be
substantially faster, particularly for small to moderately loaded nodes.
@aayushshah15 aayushshah15 force-pushed the 20220503_proactivelyEnqueueReplicasWhenDecommissioning branch from a1d4d6f to eeb7236 Compare June 8, 2022 16:26
@aayushshah15
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 8, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 8, 2022

Build succeeded:

@craig craig bot merged commit e9456ba into cockroachdb:master Jun 8, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 8, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ca59db4 to blathers/backport-release-22.1-80993: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 9, 2022
This commit fixes a bug from cockroachdb#80993. Without this commit, nodes
might re-run the callback to enqueue a decommissioning node's ranges
into their replicate queues if they received a gossip update from that
decommissioning node that was perceived to be newer. Re-running this
callback on every newer gossip update from a decommissioning node will
be too expensive for nodes with a lot of replicas.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 9, 2022
82555: sql: fix CREATE TABLE LIKE with implicit pk r=jasonmchan a=jasonmchan

Previously, `CREATE TABLE LIKE` copied implicitly created columns (e.g.
for the rowid default primary key and hash sharded index). Defaults for
some of these columns were not properly copied over in some cases,
causing unexpected constraint violations to surface.

This commit fixes this by skipping copying such columns; instead, they
will be freshly created. Followup work is needed for REGIONAL BY ROW.

Fixes #82401

Release note: None

82569: sql/schemachanger/rel,scplan/rules: add support for rules, _; adopt r=ajwerner a=ajwerner

The first commit extends the `rel` language with support for rules and `_` and adopts it for the dep rules. The second commit contains further cleanup and adopts in op rules. 

Release note: None

82652: ccl/sqlproxyccl: fix inaccurate CurConnCount metric due to goroutine leak r=JeffSwenson a=jaylim-crl

Previously, there was a possibility where a processor can return from
resuming because the client's connection was closed _before_ waitResumed even
has the chance to wake up to check on the resumed field. When that happens,
the connection goroutine will be blocked forever, and the CurConnCount metric
will never be decremented, even if the connection has already been terminated.

When the client's connection was closed, the forwarder's context will be
cancelled as well. The ideal behavior would be to terminate all waiters when
that happens, but the current code does not do that. This commit fixes that
issue by adding a new closed state to the processors, and ensuring that the
processor is closed whenever resume returns with an error. waitResumed can
then check on this state before going back to wait.

Release note: None

82683: server: don't re-run node decommissioning callback r=aayushshah15 a=aayushshah15

This commit fixes a bug from #80993. Without this commit, nodes
might re-run the callback to enqueue a decommissioning node's ranges
into their replicate queues if they received a gossip update from that
decommissioning node that was perceived to be newer. Re-running this
callback on every newer gossip update from a decommissioning node will
be too expensive for nodes with a lot of replicas.

Release note: None

Co-authored-by: Jason Chan <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 10, 2022
This commit fixes a bug from cockroachdb#80993. Without this commit, nodes
might re-run the callback to enqueue a decommissioning node's ranges
into their replicate queues if they received a gossip update from that
decommissioning node that was perceived to be newer. Re-running this
callback on every newer gossip update from a decommissioning node will
be too expensive for nodes with a lot of replicas.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 12, 2022
81005: kvserver: retry failures to rebalance decommissioning replicas r=aayushshah15 a=aayushshah15

Related to #80993
Relates to #79453


This commit makes it such that failures to rebalance replicas on
decommissioning nodes no longer move the replica out of the
replicateQueue as they previously used to. Instead, these failures now
put these replicas into the replicateQueue's purgatory, which will retry
these replicas every minute.

All this is intended to improve the speed of decommissioning towards
its tail end, since previously, failures to rebalance these replicas
meant that they were only retried after about 10 minutes.

Release note: None


Co-authored-by: Aayush Shah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: proactively enqueue replicas for a decommissioning node
4 participants