server: improve visibility of ranges that fail to move during decommissioning #76516

cameronnunez · 2022-02-14T17:20:46Z

This patch makes it so that when a decommission is slow or stalls, the
descriptions of some "stuck" replicas are printed to the operator.

Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality

cockroach-teamcity · 2022-02-14T17:20:58Z

This change is

cameronnunez · 2022-02-28T15:09:12Z

in action it looks like:

  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   4 |  true   |        5 |        true        | decommissioning |    false
(1 row)
....................
W220228 03:18:56.167425 1 1@cli/node.go:626  [-] 7  possible decommission stall detected; reporting decommissioning replicas
W220228 03:18:56.167454 1 1@cli/node.go:633  [-] 8  n4 decommissioning replica 2 for r7
W220228 03:18:56.167467 1 1@cli/node.go:633  [-] 9  n4 decommissioning replica 4 for r13
W220228 03:18:56.167474 1 1@cli/node.go:633  [-] 10  n4 decommissioning replica 3 for r33
W220228 03:18:56.167480 1 1@cli/node.go:633  [-] 11  n4 decommissioning replica 3 for r35
W220228 03:18:56.167487 1 1@cli/node.go:633  [-] 12  n4 decommissioning replica 4 for r42
  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   4 |  true   |        4 |        true        | decommissioning |    false
(1 row)

knz

This is nice! 💯

Where is the test code for it though?

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @cameronnunez, and @irfansharif)

pkg/cli/node.go, line 514 at r1 (raw file):

			// Set verbosity to true if there's been significant time of no progress.
			if sameStatusCount == sameStatusThreshold {

For ease of readability / maintainance / extensibility, I recommend >= here.

pkg/cli/node.go, line 623 at r1 (raw file):

func reportDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) {
	fmt.Fprintln(stderr)
	log.Ops.Warning(ctx, "possible decommission stall detected; reporting decommissioning replicas")

nit: I think this is a good case for fmt.Fprintf(stderr, ...) i.e. we want to see the warnings in the CLI regardless of the logging config.

pkg/cli/node.go, line 627 at r1 (raw file):

	for _, status := range resp.Status {
		for _, replica := range status.Replicas {
			log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID)

ditto

pkg/cli/node.go, line 628 at r1 (raw file):

		for _, replica := range status.Replicas {
			log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID)
		}

(separately, there may be wisdom on adding log.Ops.XXX server-side, in addition to what the client does, so that the monitoring tools that only look at the log output also notice the stalled replicas. Your call.

pkg/server/admin.go, line 2310 at r1 (raw file):

	// operator. reportLimit is the number of replicas reported for each node.
	var replicasToReport map[roachpb.NodeID][]*serverpb.DecommissionStatusResponse_Replica
	const reportLimit = 5

Why did you choose a const here instead of a parameter in the request that can be customized by the client?

cameronnunez

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @irfansharif, @knz, and @lidorcarmel)

pkg/cmd/roachtest/tests/decommission.go, line 1117 at r3 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

for my curiosity - you can do the same with 4 nodes instead of 6, right?

Not when forcing a decommission stall, as far as I know.

pkg/cmd/roachtest/tests/decommission.go, line 1133 at r3 (raw file):