Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: improve visibility of ranges that fail to move during decommissioning #76516

Merged
merged 1 commit into from
Mar 28, 2022

Conversation

cameronnunez
Copy link
Contributor

@cameronnunez cameronnunez commented Feb 14, 2022

Fixes #76249. Informs #74158.

This patch makes it so that when a decommission is slow or stalls, the
descriptions of some "stuck" replicas are printed to the operator.

Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality

@cameronnunez cameronnunez requested a review from a team as a code owner February 14, 2022 17:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cameronnunez cameronnunez marked this pull request as draft February 14, 2022 17:21
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch from d2c9778 to 4def6e2 Compare February 14, 2022 17:24
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 6 times, most recently from 99580fb to 5aee774 Compare February 25, 2022 23:34
@cameronnunez cameronnunez changed the title [WIP] server: improve visibility of ranges that fail to move during decommissioning server: improve visibility of ranges that fail to move during decommissioning Feb 25, 2022
@cameronnunez cameronnunez marked this pull request as ready for review February 25, 2022 23:50
@cameronnunez cameronnunez requested a review from a team as a code owner February 25, 2022 23:50
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch from 5aee774 to fecc6a7 Compare February 27, 2022 23:02
@cameronnunez cameronnunez marked this pull request as draft February 28, 2022 02:16
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 2 times, most recently from a15a71f to b5522bf Compare February 28, 2022 03:30
@cameronnunez cameronnunez marked this pull request as ready for review February 28, 2022 03:31
@cameronnunez
Copy link
Contributor Author

cameronnunez commented Feb 28, 2022

in action it looks like:

  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   4 |  true   |        5 |        true        | decommissioning |    false
(1 row)
....................
W220228 03:18:56.167425 1 1@cli/node.go:626  [-] 7  possible decommission stall detected; reporting decommissioning replicas
W220228 03:18:56.167454 1 1@cli/node.go:633  [-] 8  n4 decommissioning replica 2 for r7
W220228 03:18:56.167467 1 1@cli/node.go:633  [-] 9  n4 decommissioning replica 4 for r13
W220228 03:18:56.167474 1 1@cli/node.go:633  [-] 10  n4 decommissioning replica 3 for r33
W220228 03:18:56.167480 1 1@cli/node.go:633  [-] 11  n4 decommissioning replica 3 for r35
W220228 03:18:56.167487 1 1@cli/node.go:633  [-] 12  n4 decommissioning replica 4 for r42
  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   4 |  true   |        4 |        true        | decommissioning |    false
(1 row)

@cameronnunez cameronnunez requested a review from knz February 28, 2022 15:09
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 3 times, most recently from 3886837 to 192654a Compare February 28, 2022 17:26
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! 💯

Where is the test code for it though?

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @cameronnunez, and @irfansharif)


pkg/cli/node.go, line 514 at r1 (raw file):

			// Set verbosity to true if there's been significant time of no progress.
			if sameStatusCount == sameStatusThreshold {

For ease of readability / maintainance / extensibility, I recommend >= here.


pkg/cli/node.go, line 623 at r1 (raw file):

func reportDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) {
	fmt.Fprintln(stderr)
	log.Ops.Warning(ctx, "possible decommission stall detected; reporting decommissioning replicas")

nit: I think this is a good case for fmt.Fprintf(stderr, ...) i.e. we want to see the warnings in the CLI regardless of the logging config.


pkg/cli/node.go, line 627 at r1 (raw file):

	for _, status := range resp.Status {
		for _, replica := range status.Replicas {
			log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID)

ditto


pkg/cli/node.go, line 628 at r1 (raw file):

		for _, replica := range status.Replicas {
			log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID)
		}

(separately, there may be wisdom on adding log.Ops.XXX server-side, in addition to what the client does, so that the monitoring tools that only look at the log output also notice the stalled replicas. Your call.


pkg/server/admin.go, line 2310 at r1 (raw file):

	// operator. reportLimit is the number of replicas reported for each node.
	var replicasToReport map[roachpb.NodeID][]*serverpb.DecommissionStatusResponse_Replica
	const reportLimit = 5

Why did you choose a const here instead of a parameter in the request that can be customized by the client?

@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 4 times, most recently from b4235d7 to 05adcce Compare February 28, 2022 22:36
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 5 times, most recently from 078e5a4 to da0c29c Compare March 8, 2022 20:31
Copy link
Contributor Author

@cameronnunez cameronnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @irfansharif, @knz, and @lidorcarmel)


pkg/cmd/roachtest/tests/decommission.go, line 1117 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

for my curiosity - you can do the same with 4 nodes instead of 6, right?

Not when forcing a decommission stall, as far as I know.


pkg/cmd/roachtest/tests/decommission.go, line 1133 at r3 (raw file):

Previously, knz (kena) wrote…

why do you need a workload for the test?

Fixed.


pkg/cmd/roachtest/tests/decommission.go, line 1141 at r3 (raw file):

Previously, knz (kena) wrote…

is there a way to use the existing "wait for full replication" code from other tests?

Yes, just opened a PR to update the up-replication wait utility. #77499


pkg/cmd/roachtest/tests/decommission.go, line 1199 at r3 (raw file):

Previously, knz (kena) wrote…

if you remove the workload, you'll see that it's possible for the stall to happen much faster than 5 minutes.

Yup, brought it down to 3 minutes.


pkg/cmd/roachtest/tests/decommission.go, line 1207 at r3 (raw file):

Previously, knz (kena) wrote…

You can use the same logic as testutils.SucceedsSoon to wait until the decommissioning stall progress appears in logs, and then stop the test immediately. This prevents waiting for the full timeout delay in case the stall message occurs sooner than that.

Done.


pkg/server/admin.go, line 2310 at r1 (raw file):

Previously, knz (kena) wrote…

it's easier to change the cli code after the fact (e.g. admins can use the CLI from a later version). So this is good.

Sounds good.


pkg/server/admin.go, line 2344 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

should probably be '<'?

Fixed.


pkg/server/serverpb/admin.proto, line 511 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

perhaps worth mentioning that this is tied to num_replica_report from the request? so that it will be clear this is not a complete list.. maybe even rename to reported_replicas? up to you.

Fixed.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice

Reviewed 16 of 18 files at r5, 6 of 6 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @irfansharif, and @lidorcarmel)

@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 2 times, most recently from e1d7112 to 835d8ca Compare March 11, 2022 17:13
Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: except for the comment about changing the wording. Apologies for taking so long to get to this.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @cameronnunez, @irfansharif, @knz, and @lidorcarmel)


pkg/cli/node.go, line 630 at r7 (raw file):

	for _, nodeStatus := range resp.Status {
		for _, replica := range nodeStatus.ReportedReplicas {
			fmt.Fprintf(stderr,

Just to confirm, these messages will show up in a debug zip, yes?


pkg/cli/node.go, line 631 at r7 (raw file):

		for _, replica := range nodeStatus.ReportedReplicas {
			fmt.Fprintf(stderr,
				"n%d decommissioning replica %d for r%d\n",

This message almost reads like the decommissioning node is responsible for moving its own replicas off. I'd suggest changing this to something simpler like "n%d still has replica id %d for range r%d".

@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch 2 times, most recently from 7152bbb to a48b62b Compare March 15, 2022 18:06
@cameronnunez
Copy link
Contributor Author

TFYRs!

bors=knz,aayushshah15

@cameronnunez
Copy link
Contributor Author

bors r=knz,aayushshah15

@craig
Copy link
Contributor

craig bot commented Mar 20, 2022

Build failed:

@cameronnunez
Copy link
Contributor Author

bors r-

@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch from a48b62b to bd8e03d Compare March 21, 2022 13:24
…ssioning

This patch makes it so that when a decommission is slow or stalls, the
descriptions of some "stuck" replicas are printed to the operator.

Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality
@cameronnunez cameronnunez force-pushed the increase-verbosity-decomm branch from bd8e03d to 920d94c Compare March 28, 2022 14:36
@cameronnunez
Copy link
Contributor Author

bors r=knz,aayushshah15

@craig
Copy link
Contributor

craig bot commented Mar 28, 2022

Build succeeded:

@cameronnunez
Copy link
Contributor Author

cameronnunez commented Mar 29, 2022

@knz we'll want to backport this to 22.1 right?

@knz
Copy link
Contributor

knz commented Mar 29, 2022

yes

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 18 files at r7, 3 of 4 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/cli/node.go, line 626 at r12 (raw file):

func printDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) {
	fmt.Fprintln(stderr, "\npossible decommission stall detected; reporting decommissioning replicas")

Before you can backport this, you will need to tweak the message here to remove "reporting decommissiong replicas" if there is no replica to report in the response payload.

@cameronnunez
Copy link
Contributor Author

understood 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: make visible the ranges that fail to move during a decommission
5 participants