Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loqrecovery,cli: add debug recover verify command #96265

Merged

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Jan 31, 2023

This commit adds debug recovery verify command which provides
the status of loss of quorum recovery plan application status.
The command is used after debug recover apply-plan was used to
stage a recovery plan on a cluster to check application progress.
It allows user to check which nodes still needs to be restarted,
outcome of recovery on restarted nodes and health of ranges on
the entire cluster.

Release note: None

Fixes #93043

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 changed the title strutil: add helper to join slices of identifiers loqrecovery,cli: verify command to check progress of loss of quorum recovery application Jan 31, 2023
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 2 times, most recently from 2a51d42 to 1b2e12a Compare February 1, 2023 11:56
@aliher1911 aliher1911 changed the title loqrecovery,cli: verify command to check progress of loss of quorum recovery application loqrecovery,cli: add debug recover verify commend Feb 1, 2023
@aliher1911 aliher1911 changed the title loqrecovery,cli: add debug recover verify commend loqrecovery,cli: add debug recover verify command Feb 1, 2023
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 6 times, most recently from e891bd4 to ae7f52b Compare February 1, 2023 19:37
@aliher1911 aliher1911 self-assigned this Feb 2, 2023
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 7 times, most recently from 120d06c to 626408c Compare February 6, 2023 10:26
@aliher1911 aliher1911 marked this pull request as ready for review February 6, 2023 10:26
@aliher1911 aliher1911 requested review from a team as code owners February 6, 2023 10:26
@aliher1911 aliher1911 requested review from a team February 6, 2023 10:26
@aliher1911 aliher1911 requested a review from a team as a code owner February 6, 2023 10:26
@aliher1911
Copy link
Contributor Author

Might need to disable stress race on some integration tests that need to wait for meta queries to time out. Will do that once next round of CI runs is complete.

@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 2 times, most recently from 81a5a9a to acd47d2 Compare February 6, 2023 12:07
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 3 times, most recently from 8f734dd to dcbfaae Compare February 6, 2023 20:38
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch 4 times, most recently from 208a714 to 4b36ef5 Compare February 7, 2023 10:27
pkg/testutils/listener.go Outdated Show resolved Hide resolved
pkg/base/test_server_args.go Outdated Show resolved Hide resolved
pkg/testutils/listener.go Show resolved Hide resolved
pkg/testutils/listener.go Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto Outdated Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto Outdated Show resolved Hide resolved
pkg/server/loss_of_quorum.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/server.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/server.go Outdated Show resolved Hide resolved
pkg/cli/debug_recover_loss_of_quorum.go Outdated Show resolved Hide resolved
@dhartunian dhartunian removed request for a team February 7, 2023 14:46
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch from 4b36ef5 to a0dd7b9 Compare February 8, 2023 10:26
This commit adds debug recovery verify command which provides
the status of loss of quorum recovery plan application status.
The command is used after debug recover apply-plan was used to
stage a recovery plan on a cluster to check application progress.
It allows user to check which nodes still needs to be restarted,
outcome of recovery on restarted nodes and health of ranges on
the entire cluster.

Release note: None
@aliher1911 aliher1911 force-pushed the loq_05online_verify_application branch from a0dd7b9 to cbf0914 Compare February 8, 2023 10:32
@aliher1911
Copy link
Contributor Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Feb 8, 2023

Build succeeded:

@craig craig bot merged commit ec38fb4 into cockroachdb:master Feb 8, 2023
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 21 files at r1, 2 of 10 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/testutils/listener.go line 38 at r2 (raw file):

}

func NewListenerRegistry() ListenerRegistry {

Please add explanatory comments on all the exported methods in this file in a follow-up PR. I am surprised this wasn't caught by our linter.


pkg/testutils/listener.go line 155 at r2 (raw file):

}

// Addr implements net.Listener interface

nit: period at end of comment.


pkg/util/grpcutil/grpc_util.go line 69 at r2 (raw file):

func IsConnectionUnavailable(err error) bool {
	if s, ok := status.FromError(errors.UnwrapAll(err)); ok {
		return s.Code() == codes.Unavailable || s.Code() == codes.FailedPrecondition

@aliher1911 what is this change about? I don't see why it is necessary in the rest of the PR. Also, the comment above the function has become stale. I would welcome a follow-up PR which clarifies both.

@aliher1911
Copy link
Contributor Author

pkg/util/grpcutil/grpc_util.go line 69 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@aliher1911 what is this change about? I don't see why it is necessary in the rest of the PR. Also, the comment above the function has become stale. I would welcome a follow-up PR which clarifies both.

This method is used to check if remote servers are unavailable, but it was not working correctly when node itself blacklists remote end regardless if it is available or not. For the sake of check those servers are unavailable as well because cluster would never be able to connect. I'll add clarification comment in the follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loqrecovery: verify replica update plan application
4 participants