-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loqrecovery: make make-plan CLI print updates #127001
Conversation
f0abb9c
to
9e1d396
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I like the approach you landed on. A few comments but mostly nits.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)
-- commits
line 4 at r1:
nit: We usually wrap commit messages to 72 characters.
"is it stuck" -> "it is stuck"
-- commits
line 8 at r1:
nit: It's definitely great to include the verification steps in the PR but not necessarily in the commit message. It makes sense to include results in the commit message if they demonstrate some improvement that can be referenced in the release notes or other external docs. But for repros, I usually just include the steps as a comment in the PR, or in the description.
-- commits
line 51 at r1:
nit: There are a few key words that get parsed from commit messages, and are useful in specifying the GH issue. E.g. if this is a single-PR fix for the issue you're working on, you can add Fixes: #122640
, and that will auto-close the GH issue when this PR merges.
-- commits
line 53 at r1:
nit: The Jira issue and Epic get linked on the PR automatically, no need to include them here.
pkg/kv/kvserver/loqrecovery/collect.go
line 54 at r1 (raw file):
// and when a node needs to be revisited. func CollectRemoteReplicaInfo( ctx context.Context, c serverpb.AdminClient, maxConcurrency int, logOutput *os.File,
nit: I see stderr
passed in as a io.Writer
in other places. Might be a more general way to do it.
pkg/kv/kvserver/loqrecovery/collect.go
line 80 at r1 (raw file):
if _, ok := replInfoMap[r.NodeID]; !ok && logOutput != nil { _, _ = fmt.Fprintf(logOutput, "Started getting replica info for node_id:%d.\n", r.NodeID)
Out of curiosity, when you run the tool, does it run fairly fast and print all of these at once? Or does it go more slowly, giving a sense of progress?
pkg/cli/debug_recover_loss_of_quorum_test.go
line 727 at r1 (raw file):
require.NoError(t, err, "failed to run make-plan") require.Contains(t, out, "Started getting replica info for node_id:1",
Do you need to run an entire LoQ setup and recovery to test this? Is it possible to add these assertions to one of the TestCollectInfo*
tests?
Verifying the tool:
Output while the tool was running:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @miraradeva)
pkg/kv/kvserver/loqrecovery/collect.go
line 80 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Out of curiosity, when you run the tool, does it run fairly fast and print all of these at once? Or does it go more slowly, giving a sense of progress?
Yes, the new logs seem to indicate actual progress. Like there is a log for when the node is first visited. That made it print a log line every ~10 seconds in my test (when running the LoQ as single threaded).
pkg/cli/debug_recover_loss_of_quorum_test.go
line 727 at r1 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Do you need to run an entire LoQ setup and recovery to test this? Is it possible to add these assertions to one of the
TestCollectInfo*
tests?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)
-- commits
line 13 at r2:
nit: Just "Fixes: #122640". GH links it in the UI.
pkg/kv/kvserver/loqrecovery/collect.go
line 80 at r1 (raw file):
Previously, iskettaneh wrote…
Yes, the new logs seem to indicate actual progress. Like there is a log for when the node is first visited. That made it print a log line every ~10 seconds in my test (when running the LoQ as single threaded).
Nice!
pkg/kv/kvserver/loqrecovery/collect.go
line 90 at r2 (raw file):
if logOutput != nil { _, _ = fmt.Fprintf(logOutput, "Discarding replica info for node_id:%d."+ "The node will be revisted\n", s.NodeID)
nit: Full stop at the end of "The node will be revisited.".
Right now, the loss-of-quorum CLI tool only prints the final recovery plan after it finishes running. However, for large clusters, it might take a long time until it finishes. This made it unclear whether the tool is still making progress, or it is stuck. This PR changes that by making the tool print some of the server updates. In particular, the CLI tool now will print the node that the server is currently streaming replica info from. Fixes: cockroachdb#122640 Release note: None
bors r+ |
loqrecovery: make make-plan CLI print updates
Right now, the loss-of-quorum CLI tool only prints the final recovery
plan after it finishes running. However, for large clusters, it might
take a long time until it finishes. This made it unclear whether the
tool is still making progress, or it is stuck.
This PR changes that by making the tool print some of the server
updates. In particular, the CLI tool now will print the node that the
server is currently streaming replica info from.
Fixes: #122640
Release note: None