Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

Merged
merged 2 commits into from
Jan 24, 2023

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Jan 17, 2023

loqrecovery,admin,cli: stage recovery plan on cluster

This commit adds loss of quorum recovery plan staging on nodes.
RecoveryStagePlan admin call is distributing recovery plan to
relevant nodes of the cluster. To do so, it first verifies that
cluster state is unchanged from the state where plan was created
and there are no previously staged plans.
Then it distributes plan to all cluster nodes using fan-out mechanism.
Each node in turn markes dead nodes as decommissioned and if there
are planned changes for the node it saves plan in the local store.
Admin call is backed by debug recover apply-plan command when using
--host flag to work in half-online mode.

Release note: None


loqrecovery,admin: implement endpoint to check staged plans

This commit adds loss of quorum recovery verify call to admin
interface. Call allows querying loss of quorum recovery status
from all nodes of the cluster. It provides info about loss of
quorum recovery plans staged on each node.

Release note: None


State checkpoint is included in the PR as it only provide partial functionality of state needed for stage phase to work.

Fixes #93044
Fixes #74135
Touches #93043

When doing staging, cli would present following reports in happy case:

$ cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true recover-plan.json
Proposed changes in plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4:
  range r93:/Table/106/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\n" updating replica 2 to 16.
  range r92:/Table/106/1/"los angeles"/"\x99\x99\x99\x99\x99\x99H\x00\x80\x00\x00\x00\x00\x00\x00\x1e" updating replica 2 to 16.
  range r91:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica 2 to 16.
  range r115:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica 2 to 15.
  range r80:/Table/107 updating replica 1 to 15.
  range r96:/Table/107/1/"san francisco"/"\x88\x88\x88\x88\x88\x88H\x00\x80\x00\x00\x00\x00\x00\x00\b" updating replica 1 to 15.
  range r102:/Table/107/1/"seattle"/"UUUUUUD\x00\x80\x00\x00\x00\x00\x00\x00\x05" updating replica 4 to 16.
  range r89:/Table/107/2 updating replica 1 to 15.
  range r126:/Table/108/1/"amsterdam"/"\xc5\x1e\xb8Q\xeb\x85@\x00\x80\x00\x00\x00\x00\x00\x01\x81" updating replica 2 to 16.
  range r104:/Table/108/1/"los angeles"/"\xa8\xf5\u008f\\(H\x00\x80\x00\x00\x00\x00\x00\x01J" updating replica 3 to 16.
  range r119:/Table/108/1/"san francisco"/"\x8c\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x01\x13" updating replica 6 to 18.
  range r117:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica 4 to 17.
  range r155:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica 3 to 15.
  range r82:/Table/108/3 updating replica 1 to 15.

Nodes n4, n5 will be marked as decommissioned.


Proceed with staging plan [y/N] y

Plan staged. To complete recovery restart nodes n1, n2, n3.

To verify recovery status invoke

'cockroach debug recover verify  --host=127.0.0.1:26257 --insecure=true recover-plan.json'

And allow overwriting of currently stages plans if need be:

$ cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true recover-plan-2.json
Proposed changes in plan 576f3d2e-518c-4dbc-9af4-b416629bbf1a:
  range r93:/Table/106/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\n" updating replica 2 to 16.
  range r92:/Table/106/1/"los angeles"/"\x99\x99\x99\x99\x99\x99H\x00\x80\x00\x00\x00\x00\x00\x00\x1e" updating replica 2 to 16.
  range r91:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica 2 to 16.
  range r115:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica 2 to 15.
  range r80:/Table/107 updating replica 1 to 15.
  range r96:/Table/107/1/"san francisco"/"\x88\x88\x88\x88\x88\x88H\x00\x80\x00\x00\x00\x00\x00\x00\b" updating replica 1 to 15.
  range r102:/Table/107/1/"seattle"/"UUUUUUD\x00\x80\x00\x00\x00\x00\x00\x00\x05" updating replica 4 to 16.
  range r89:/Table/107/2 updating replica 1 to 15.
  range r126:/Table/108/1/"amsterdam"/"\xc5\x1e\xb8Q\xeb\x85@\x00\x80\x00\x00\x00\x00\x00\x01\x81" updating replica 2 to 16.
  range r104:/Table/108/1/"los angeles"/"\xa8\xf5\u008f\\(H\x00\x80\x00\x00\x00\x00\x00\x01J" updating replica 3 to 16.
  range r119:/Table/108/1/"san francisco"/"\x8c\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x01\x13" updating replica 6 to 18.
  range r117:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica 4 to 17.
  range r155:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica 3 to 15.
  range r82:/Table/108/3 updating replica 1 to 15.

Nodes n4, n5 will be marked as decommissioned.

Conflicting staged plans will be replaced:
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n1.
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n3.
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n2.


Proceed with staging plan [y/N] y

Plan staged. To complete recovery restart nodes n1, n2, n3.

To verify recovery status invoke

'cockroach debug recover verify  --host=127.0.0.1:26257 --insecure=true recover-plan-2.json'

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 force-pushed the loq_05online_stage_plans branch 5 times, most recently from a630832 to d2fb2db Compare January 18, 2023 11:40
return nil, err
}

log.Ops.Info(ctx, "checking loss of quorum recovery node status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log all of these? Can we bump some of them to VEvent to reduce the log spam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are non-material, so we can probably drop that. For staging I think it would be helpful to keep.

pkg/kv/kvserver/loqrecovery/server.go Outdated Show resolved Hide resolved
}
}

// Distribute plan - this should not use fan out to available, but use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we shouldn't use fanout, why don't we dial them directly? It doesn't seem like we check for unexpected nodes here either, i.e. nodes that are not in foundNodes but that are found via visitNodes(), so a node could come back online after the verification above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fortunate or unfortunate leak of the comment from first iteration. I was thinking that we can use same set of nodes to distribute the plan, but it looks like we don't need to do that. We can do fan out again, since a this point it is not reversible anyways and once we started we should keep trying. The best thing we could do is report if plan didn't reach some nodes that we wanted it to reach.

pkg/cli/debug_recover_loss_of_quorum.go Outdated Show resolved Hide resolved
@aliher1911 aliher1911 force-pushed the loq_05online_stage_plans branch 3 times, most recently from 27b03ca to b05fbcc Compare January 20, 2023 20:38
@aliher1911 aliher1911 marked this pull request as ready for review January 23, 2023 17:44
@aliher1911 aliher1911 requested review from a team as code owners January 23, 2023 17:44
@aliher1911 aliher1911 requested review from a team January 23, 2023 17:44
@aliher1911 aliher1911 requested review from a team as code owners January 23, 2023 17:44
This commit adds loss of quorum recovery verify call to admin
interface. Call allows querying loss of quorum recovery status
from all nodes of the cluster. It provides info about loss of
quorum recovery plans staged on each node.

Release note: None
This commit adds loss of quorum recovery plan staging on nodes.
RecoveryStagePlan admin call is distributing recovery plan to
relevant nodes of the cluster. To do so, it first verifies that
cluster state is unchanged from the state where plan was created
and there are no previously staged plans.
Then it distributes plan to all cluster nodes using fan-out mechanism.
Each node in turn markes dead nodes as decommissioned and if there
are planned changes for the node it saves plan in the local store.
Admin call is backed by debug recover apply-plan command when using
--host flag to work in half-online mode.

Release note: None
@aliher1911 aliher1911 force-pushed the loq_05online_stage_plans branch from b05fbcc to f9957a2 Compare January 23, 2023 18:21
@aliher1911 aliher1911 self-assigned this Jan 23, 2023
@aliher1911
Copy link
Contributor Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Jan 24, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 24, 2023

Build succeeded:

@craig craig bot merged commit c8d92f3 into cockroachdb:master Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants