Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loqrecovery: make make-plan CLI print updates #127001

Merged
merged 1 commit into from
Jul 16, 2024
Merged

Conversation

iskettaneh
Copy link
Contributor

@iskettaneh iskettaneh commented Jul 11, 2024

loqrecovery: make make-plan CLI print updates

Right now, the loss-of-quorum CLI tool only prints the final recovery
plan after it finishes running. However, for large clusters, it might
take a long time until it finishes. This made it unclear whether the
tool is still making progress, or it is stuck.

This PR changes that by making the tool print some of the server
updates. In particular, the CLI tool now will print the node that the
server is currently streaming replica info from.

Fixes: #122640

Release note: None

@iskettaneh iskettaneh self-assigned this Jul 11, 2024
@iskettaneh iskettaneh requested review from a team as code owners July 11, 2024 15:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@iskettaneh iskettaneh force-pushed the loq branch 4 times, most recently from f0abb9c to 9e1d396 Compare July 11, 2024 17:38
Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I like the approach you landed on. A few comments but mostly nits.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)


-- commits line 4 at r1:
nit: We usually wrap commit messages to 72 characters.

"is it stuck" -> "it is stuck"


-- commits line 8 at r1:
nit: It's definitely great to include the verification steps in the PR but not necessarily in the commit message. It makes sense to include results in the commit message if they demonstrate some improvement that can be referenced in the release notes or other external docs. But for repros, I usually just include the steps as a comment in the PR, or in the description.


-- commits line 51 at r1:
nit: There are a few key words that get parsed from commit messages, and are useful in specifying the GH issue. E.g. if this is a single-PR fix for the issue you're working on, you can add Fixes: #122640, and that will auto-close the GH issue when this PR merges.


-- commits line 53 at r1:
nit: The Jira issue and Epic get linked on the PR automatically, no need to include them here.


pkg/kv/kvserver/loqrecovery/collect.go line 54 at r1 (raw file):

// and when a node needs to be revisited.
func CollectRemoteReplicaInfo(
	ctx context.Context, c serverpb.AdminClient, maxConcurrency int, logOutput *os.File,

nit: I see stderr passed in as a io.Writer in other places. Might be a more general way to do it.


pkg/kv/kvserver/loqrecovery/collect.go line 80 at r1 (raw file):

			if _, ok := replInfoMap[r.NodeID]; !ok && logOutput != nil {
				_, _ = fmt.Fprintf(logOutput, "Started getting replica info for node_id:%d.\n", r.NodeID)

Out of curiosity, when you run the tool, does it run fairly fast and print all of these at once? Or does it go more slowly, giving a sense of progress?


pkg/cli/debug_recover_loss_of_quorum_test.go line 727 at r1 (raw file):

	require.NoError(t, err, "failed to run make-plan")

	require.Contains(t, out, "Started getting replica info for node_id:1",

Do you need to run an entire LoQ setup and recovery to test this? Is it possible to add these assertions to one of the TestCollectInfo* tests?

@iskettaneh
Copy link
Contributor Author

Verifying the tool:

roachprod create ibrahimkettaneh-loq -n 16 --gce-machine-type=n2-standard-32 --local-ssd=true --gce-local-ssd-count=4 --gce-enable-multiple-stores
./dev build --cross; roachprod put ibrahimkettaneh-loq artifacts/cockroach cockroach
roachprod start  ibrahimkettaneh-loq --store-count=4
roachprod run    ibrahimkettaneh-loq:1 -- './cockroach workload init kv --splits=200000 {pgurl:1}'

roachprod sql ibrahimkettaneh-loq:1 -- -e "ALTER RANGE default CONFIGURE ZONE USING num_replicas = 9"
roachprod sql ibrahimkettaneh-loq:1 -- -e "ALTER RANGE default CONFIGURE ZONE USING num_replicas = 3"

roachprod run ibrahimkettaneh-loq:1 -- './cockroach workload init tpcc --warehouses=100000 {pgurl:1}'
roachprod run ibrahimkettaneh-loq:1 -- './cockroach workload run kv --min-block-bytes=40000 --max-block-bytes=40000 --concurrency=256 --read-percent=50 --duration=5m {pgurl:1-9}'

roachprod stop ibrahimkettaneh-loq:6,9

roachprod run ibrahimkettaneh-loq:1 -- './cockroach debug recover make-plan -c 1 --certs-dir=./certs --confirm y --port={pgport:1} > plan.json'
roachprod run ibrahimkettaneh-loq:1 -- './cockroach debug recover make-plan -c 4 --certs-dir=./certs --confirm y --port={pgport:1} > plan.json'
roachprod run ibrahimkettaneh-loq:1 -- './cockroach debug recover make-plan -c 32 --certs-dir=./certs --confirm y --port={pgport:1} > plan.json'

roachprod destroy ibrahimkettaneh-loq

Output while the tool was running:

Started getting replica info for node_id:15.
Started getting replica info for node_id:2.
Started getting replica info for node_id:1.
Started getting replica info for node_id:4.
Started getting replica info for node_id:13.
Started getting replica info for node_id:11.
Started getting replica info for node_id:12.
Started getting replica info for node_id:8.
Started getting replica info for node_id:6.
Started getting replica info for node_id:3.
Started getting replica info for node_id:5.
Started getting replica info for node_id:16.
Started getting replica info for node_id:14.
Started getting replica info for node_id:9.
Started getting replica info for node_id:7.
Started getting replica info for node_id:10.

@iskettaneh iskettaneh requested a review from miraradeva July 15, 2024 17:24
Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @miraradeva)


pkg/kv/kvserver/loqrecovery/collect.go line 80 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Out of curiosity, when you run the tool, does it run fairly fast and print all of these at once? Or does it go more slowly, giving a sense of progress?

Yes, the new logs seem to indicate actual progress. Like there is a log for when the node is first visited. That made it print a log line every ~10 seconds in my test (when running the LoQ as single threaded).


pkg/cli/debug_recover_loss_of_quorum_test.go line 727 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Do you need to run an entire LoQ setup and recovery to test this? Is it possible to add these assertions to one of the TestCollectInfo* tests?

Done.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)


-- commits line 13 at r2:
nit: Just "Fixes: #122640". GH links it in the UI.


pkg/kv/kvserver/loqrecovery/collect.go line 80 at r1 (raw file):

Previously, iskettaneh wrote…

Yes, the new logs seem to indicate actual progress. Like there is a log for when the node is first visited. That made it print a log line every ~10 seconds in my test (when running the LoQ as single threaded).

Nice!


pkg/kv/kvserver/loqrecovery/collect.go line 90 at r2 (raw file):

			if logOutput != nil {
				_, _ = fmt.Fprintf(logOutput, "Discarding replica info for node_id:%d."+
					"The node will be revisted\n", s.NodeID)

nit: Full stop at the end of "The node will be revisited.".

Right now, the loss-of-quorum CLI tool only prints the final recovery
plan after it finishes running. However, for large clusters, it might
take a long time until it finishes. This made it unclear whether the
tool is still making progress, or it is stuck.

This PR changes that by making the tool print some of the server
updates. In particular, the CLI tool now will print the node that the
server is currently streaming replica info from.

Fixes: cockroachdb#122640

Release note: None
@iskettaneh
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 8f2a27b into cockroachdb:master Jul 16, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loqrecovery: output some indication of progress
3 participants