loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

aliher1911 · 2023-01-17T23:22:27Z

loqrecovery,admin,cli: stage recovery plan on cluster

This commit adds loss of quorum recovery plan staging on nodes.
RecoveryStagePlan admin call is distributing recovery plan to
relevant nodes of the cluster. To do so, it first verifies that
cluster state is unchanged from the state where plan was created
and there are no previously staged plans.
Then it distributes plan to all cluster nodes using fan-out mechanism.
Each node in turn markes dead nodes as decommissioned and if there
are planned changes for the node it saves plan in the local store.
Admin call is backed by debug recover apply-plan command when using
--host flag to work in half-online mode.

Release note: None

loqrecovery,admin: implement endpoint to check staged plans

This commit adds loss of quorum recovery verify call to admin
interface. Call allows querying loss of quorum recovery status
from all nodes of the cluster. It provides info about loss of
quorum recovery plans staged on each node.

Release note: None

State checkpoint is included in the PR as it only provide partial functionality of state needed for stage phase to work.

Fixes #93044
Fixes #74135
Touches #93043

When doing staging, cli would present following reports in happy case:

$ cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true recover-plan.json
Proposed changes in plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4:
  range r93:/Table/106/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\n" updating replica 2 to 16.
  range r92:/Table/106/1/"los angeles"/"\x99\x99\x99\x99\x99\x99H\x00\x80\x00\x00\x00\x00\x00\x00\x1e" updating replica 2 to 16.
  range r91:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica 2 to 16.
  range r115:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica 2 to 15.
  range r80:/Table/107 updating replica 1 to 15.
  range r96:/Table/107/1/"san francisco"/"\x88\x88\x88\x88\x88\x88H\x00\x80\x00\x00\x00\x00\x00\x00\b" updating replica 1 to 15.
  range r102:/Table/107/1/"seattle"/"UUUUUUD\x00\x80\x00\x00\x00\x00\x00\x00\x05" updating replica 4 to 16.
  range r89:/Table/107/2 updating replica 1 to 15.
  range r126:/Table/108/1/"amsterdam"/"\xc5\x1e\xb8Q\xeb\x85@\x00\x80\x00\x00\x00\x00\x00\x01\x81" updating replica 2 to 16.
  range r104:/Table/108/1/"los angeles"/"\xa8\xf5\u008f\\(H\x00\x80\x00\x00\x00\x00\x00\x01J" updating replica 3 to 16.
  range r119:/Table/108/1/"san francisco"/"\x8c\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x01\x13" updating replica 6 to 18.
  range r117:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica 4 to 17.
  range r155:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica 3 to 15.
  range r82:/Table/108/3 updating replica 1 to 15.

Nodes n4, n5 will be marked as decommissioned.


Proceed with staging plan [y/N] y

Plan staged. To complete recovery restart nodes n1, n2, n3.

To verify recovery status invoke

'cockroach debug recover verify  --host=127.0.0.1:26257 --insecure=true recover-plan.json'

And allow overwriting of currently stages plans if need be:

$ cockroach debug recover apply-plan  --host=127.0.0.1:26257 --insecure=true recover-plan-2.json
Proposed changes in plan 576f3d2e-518c-4dbc-9af4-b416629bbf1a:
  range r93:/Table/106/1/"boston"/"333333D\x00\x80\x00\x00\x00\x00\x00\x00\n" updating replica 2 to 16.
  range r92:/Table/106/1/"los angeles"/"\x99\x99\x99\x99\x99\x99H\x00\x80\x00\x00\x00\x00\x00\x00\x1e" updating replica 2 to 16.
  range r91:/Table/106/1/"seattle"/"ffffffH\x00\x80\x00\x00\x00\x00\x00\x00\x14" updating replica 2 to 16.
  range r115:/Table/106/1/"washington dc"/"L\xcc\xcc\xcc\xcc\xccL\x00\x80\x00\x00\x00\x00\x00\x00\x0f" updating replica 2 to 15.
  range r80:/Table/107 updating replica 1 to 15.
  range r96:/Table/107/1/"san francisco"/"\x88\x88\x88\x88\x88\x88H\x00\x80\x00\x00\x00\x00\x00\x00\b" updating replica 1 to 15.
  range r102:/Table/107/1/"seattle"/"UUUUUUD\x00\x80\x00\x00\x00\x00\x00\x00\x05" updating replica 4 to 16.
  range r89:/Table/107/2 updating replica 1 to 15.
  range r126:/Table/108/1/"amsterdam"/"\xc5\x1e\xb8Q\xeb\x85@\x00\x80\x00\x00\x00\x00\x00\x01\x81" updating replica 2 to 16.
  range r104:/Table/108/1/"los angeles"/"\xa8\xf5\u008f\\(H\x00\x80\x00\x00\x00\x00\x00\x01J" updating replica 3 to 16.
  range r119:/Table/108/1/"san francisco"/"\x8c\xcc\xcc\xcc\xcc\xcc@\x00\x80\x00\x00\x00\x00\x00\x01\x13" updating replica 6 to 18.
  range r117:/Table/108/1/"seattle"/"p\xa3\xd7\n=pD\x00\x80\x00\x00\x00\x00\x00\x00\xdc" updating replica 4 to 17.
  range r155:/Table/108/1/"washington dc"/"Tz\xe1G\xae\x14L\x00\x80\x00\x00\x00\x00\x00\x00\xa5" updating replica 3 to 15.
  range r82:/Table/108/3 updating replica 1 to 15.

Nodes n4, n5 will be marked as decommissioned.

Conflicting staged plans will be replaced:
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n1.
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n3.
  plan 66200d2c-e0e1-4af4-b890-ef5bb6e9ccc4 is staged on node n2.


Proceed with staging plan [y/N] y

Plan staged. To complete recovery restart nodes n1, n2, n3.

To verify recovery status invoke

'cockroach debug recover verify  --host=127.0.0.1:26257 --insecure=true recover-plan-2.json'

cockroach-teamcity · 2023-01-17T23:22:49Z

This change is

erikgrinaker · 2023-01-18T13:11:35Z

pkg/server/admin.go

+		return nil, err
+	}
+
+	log.Ops.Info(ctx, "checking loss of quorum recovery node status")


Do we need to log all of these? Can we bump some of them to VEvent to reduce the log spam?

Those are non-material, so we can probably drop that. For staging I think it would be helpful to keep.

pkg/kv/kvserver/loqrecovery/server.go

erikgrinaker · 2023-01-18T13:20:11Z

pkg/kv/kvserver/loqrecovery/server.go

+			}
+		}
+
+		// Distribute plan - this should not use fan out to available, but use


If we shouldn't use fanout, why don't we dial them directly? It doesn't seem like we check for unexpected nodes here either, i.e. nodes that are not in foundNodes but that are found via visitNodes(), so a node could come back online after the verification above.

That's a fortunate or unfortunate leak of the comment from first iteration. I was thinking that we can use same set of nodes to distribute the plan, but it looks like we don't need to do that. We can do fan out again, since a this point it is not reversible anyways and once we started we should keep trying. The best thing we could do is report if plan didn't reach some nodes that we wanted it to reach.

pkg/cli/debug_recover_loss_of_quorum.go

This commit adds loss of quorum recovery verify call to admin interface. Call allows querying loss of quorum recovery status from all nodes of the cluster. It provides info about loss of quorum recovery plans staged on each node. Release note: None

This commit adds loss of quorum recovery plan staging on nodes. RecoveryStagePlan admin call is distributing recovery plan to relevant nodes of the cluster. To do so, it first verifies that cluster state is unchanged from the state where plan was created and there are no previously staged plans. Then it distributes plan to all cluster nodes using fan-out mechanism. Each node in turn markes dead nodes as decommissioned and if there are planned changes for the node it saves plan in the local store. Admin call is backed by debug recover apply-plan command when using --host flag to work in half-online mode. Release note: None

aliher1911 · 2023-01-24T14:51:36Z

bors r=erikgrinaker

craig · 2023-01-24T15:47:44Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-24T17:41:50Z

Build succeeded:

Bazel Essential CI (Cockroach)

aliher1911 requested a review from erikgrinaker January 17, 2023 23:24

aliher1911 force-pushed the loq_05online_stage_plans branch 5 times, most recently from a630832 to d2fb2db Compare January 18, 2023 11:40

erikgrinaker approved these changes Jan 18, 2023

View reviewed changes

aliher1911 force-pushed the loq_05online_stage_plans branch 3 times, most recently from 27b03ca to b05fbcc Compare January 20, 2023 20:38

aliher1911 marked this pull request as ready for review January 23, 2023 17:44

aliher1911 requested review from a team as code owners January 23, 2023 17:44

aliher1911 requested review from a team January 23, 2023 17:44

aliher1911 requested review from a team as code owners January 23, 2023 17:44

aliher1911 added 2 commits January 23, 2023 18:21

aliher1911 force-pushed the loq_05online_stage_plans branch from b05fbcc to f9957a2 Compare January 23, 2023 18:21

aliher1911 self-assigned this Jan 23, 2023

craig bot merged commit c8d92f3 into cockroachdb:master Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

aliher1911 commented Jan 17, 2023 •

edited

Loading

cockroach-teamcity commented Jan 17, 2023

erikgrinaker Jan 18, 2023

aliher1911 Jan 19, 2023

erikgrinaker Jan 18, 2023

aliher1911 Jan 19, 2023

aliher1911 commented Jan 24, 2023

craig bot commented Jan 24, 2023

craig bot commented Jan 24, 2023

loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405

Conversation

aliher1911 commented Jan 17, 2023 • edited Loading

cockroach-teamcity commented Jan 17, 2023

erikgrinaker Jan 18, 2023

Choose a reason for hiding this comment

aliher1911 Jan 19, 2023

Choose a reason for hiding this comment

erikgrinaker Jan 18, 2023

Choose a reason for hiding this comment

aliher1911 Jan 19, 2023

Choose a reason for hiding this comment

aliher1911 commented Jan 24, 2023

craig bot commented Jan 24, 2023

craig bot commented Jan 24, 2023

aliher1911 commented Jan 17, 2023 •

edited

Loading