-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: improve visibility of ranges that fail to move during decommissioning #76516
server: improve visibility of ranges that fail to move during decommissioning #76516
Conversation
d2c9778
to
4def6e2
Compare
99580fb
to
5aee774
Compare
5aee774
to
fecc6a7
Compare
a15a71f
to
b5522bf
Compare
in action it looks like:
|
3886837
to
192654a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! 💯
Where is the test code for it though?
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @cameronnunez, and @irfansharif)
pkg/cli/node.go, line 514 at r1 (raw file):
// Set verbosity to true if there's been significant time of no progress. if sameStatusCount == sameStatusThreshold {
For ease of readability / maintainance / extensibility, I recommend >=
here.
pkg/cli/node.go, line 623 at r1 (raw file):
func reportDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) { fmt.Fprintln(stderr) log.Ops.Warning(ctx, "possible decommission stall detected; reporting decommissioning replicas")
nit: I think this is a good case for fmt.Fprintf(stderr, ...)
i.e. we want to see the warnings in the CLI regardless of the logging config.
pkg/cli/node.go, line 627 at r1 (raw file):
for _, status := range resp.Status { for _, replica := range status.Replicas { log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID)
ditto
pkg/cli/node.go, line 628 at r1 (raw file):
for _, replica := range status.Replicas { log.Ops.Warningf(ctx, "n%d decommissioning replica %d for r%d", status.NodeID, replica.ReplicaID, replica.RangeID) }
(separately, there may be wisdom on adding log.Ops.XXX
server-side, in addition to what the client does, so that the monitoring tools that only look at the log output also notice the stalled replicas. Your call.
pkg/server/admin.go, line 2310 at r1 (raw file):
// operator. reportLimit is the number of replicas reported for each node. var replicasToReport map[roachpb.NodeID][]*serverpb.DecommissionStatusResponse_Replica const reportLimit = 5
Why did you choose a const here instead of a parameter in the request that can be customized by the client?
b4235d7
to
05adcce
Compare
078e5a4
to
da0c29c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @irfansharif, @knz, and @lidorcarmel)
pkg/cmd/roachtest/tests/decommission.go, line 1117 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
for my curiosity - you can do the same with 4 nodes instead of 6, right?
Not when forcing a decommission stall, as far as I know.
pkg/cmd/roachtest/tests/decommission.go, line 1133 at r3 (raw file):
Previously, knz (kena) wrote…
why do you need a workload for the test?
Fixed.
pkg/cmd/roachtest/tests/decommission.go, line 1141 at r3 (raw file):
Previously, knz (kena) wrote…
is there a way to use the existing "wait for full replication" code from other tests?
Yes, just opened a PR to update the up-replication wait utility. #77499
pkg/cmd/roachtest/tests/decommission.go, line 1199 at r3 (raw file):
Previously, knz (kena) wrote…
if you remove the workload, you'll see that it's possible for the stall to happen much faster than 5 minutes.
Yup, brought it down to 3 minutes.
pkg/cmd/roachtest/tests/decommission.go, line 1207 at r3 (raw file):
Previously, knz (kena) wrote…
You can use the same logic as testutils.SucceedsSoon to wait until the decommissioning stall progress appears in logs, and then stop the test immediately. This prevents waiting for the full timeout delay in case the stall message occurs sooner than that.
Done.
pkg/server/admin.go, line 2310 at r1 (raw file):
Previously, knz (kena) wrote…
it's easier to change the cli code after the fact (e.g. admins can use the CLI from a later version). So this is good.
Sounds good.
pkg/server/admin.go, line 2344 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
should probably be '<'?
Fixed.
pkg/server/serverpb/admin.proto, line 511 at r3 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
perhaps worth mentioning that this is tied to
num_replica_report
from the request? so that it will be clear this is not a complete list.. maybe even rename toreported_replicas
? up to you.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 18 files at r5, 6 of 6 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @irfansharif, and @lidorcarmel)
e1d7112
to
835d8ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except for the comment about changing the wording. Apologies for taking so long to get to this.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @cameronnunez, @irfansharif, @knz, and @lidorcarmel)
pkg/cli/node.go, line 630 at r7 (raw file):
for _, nodeStatus := range resp.Status { for _, replica := range nodeStatus.ReportedReplicas { fmt.Fprintf(stderr,
Just to confirm, these messages will show up in a debug zip, yes?
pkg/cli/node.go, line 631 at r7 (raw file):
for _, replica := range nodeStatus.ReportedReplicas { fmt.Fprintf(stderr, "n%d decommissioning replica %d for r%d\n",
This message almost reads like the decommissioning node is responsible for moving its own replicas off. I'd suggest changing this to something simpler like "n%d still has replica id %d for range r%d".
7152bbb
to
a48b62b
Compare
TFYRs! bors=knz,aayushshah15 |
bors r=knz,aayushshah15 |
Build failed: |
bors r- |
a48b62b
to
bd8e03d
Compare
…ssioning This patch makes it so that when a decommission is slow or stalls, the descriptions of some "stuck" replicas are printed to the operator. Release note (cli change): if decommissioning is slow or stalls, decommissioning replicas are printed to the operator. Release justification: low risk, high benefit changes to existing functionality
bd8e03d
to
920d94c
Compare
bors r=knz,aayushshah15 |
Build succeeded: |
@knz we'll want to backport this to 22.1 right? |
yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 18 files at r7, 3 of 4 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/cli/node.go, line 626 at r12 (raw file):
func printDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) { fmt.Fprintln(stderr, "\npossible decommission stall detected; reporting decommissioning replicas")
Before you can backport this, you will need to tweak the message here to remove "reporting decommissiong replicas" if there is no replica to report in the response payload.
understood 👍 |
Fixes #76249. Informs #74158.
This patch makes it so that when a decommission is slow or stalls, the
descriptions of some "stuck" replicas are printed to the operator.
Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.
Release justification: low risk, high benefit changes to existing functionality