-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405
loqrecovery,admin,cli: check staged plans, stage recovery plan on cluster #95405
Conversation
a630832
to
d2fb2db
Compare
pkg/server/admin.go
Outdated
return nil, err | ||
} | ||
|
||
log.Ops.Info(ctx, "checking loss of quorum recovery node status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to log all of these? Can we bump some of them to VEvent
to reduce the log spam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are non-material, so we can probably drop that. For staging I think it would be helpful to keep.
} | ||
} | ||
|
||
// Distribute plan - this should not use fan out to available, but use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we shouldn't use fanout, why don't we dial them directly? It doesn't seem like we check for unexpected nodes here either, i.e. nodes that are not in foundNodes
but that are found via visitNodes()
, so a node could come back online after the verification above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fortunate or unfortunate leak of the comment from first iteration. I was thinking that we can use same set of nodes to distribute the plan, but it looks like we don't need to do that. We can do fan out again, since a this point it is not reversible anyways and once we started we should keep trying. The best thing we could do is report if plan didn't reach some nodes that we wanted it to reach.
27b03ca
to
b05fbcc
Compare
This commit adds loss of quorum recovery verify call to admin interface. Call allows querying loss of quorum recovery status from all nodes of the cluster. It provides info about loss of quorum recovery plans staged on each node. Release note: None
This commit adds loss of quorum recovery plan staging on nodes. RecoveryStagePlan admin call is distributing recovery plan to relevant nodes of the cluster. To do so, it first verifies that cluster state is unchanged from the state where plan was created and there are no previously staged plans. Then it distributes plan to all cluster nodes using fan-out mechanism. Each node in turn markes dead nodes as decommissioned and if there are planned changes for the node it saves plan in the local store. Admin call is backed by debug recover apply-plan command when using --host flag to work in half-online mode. Release note: None
b05fbcc
to
f9957a2
Compare
bors r=erikgrinaker |
Build failed (retrying...): |
Build succeeded: |
loqrecovery,admin,cli: stage recovery plan on cluster
This commit adds loss of quorum recovery plan staging on nodes.
RecoveryStagePlan admin call is distributing recovery plan to
relevant nodes of the cluster. To do so, it first verifies that
cluster state is unchanged from the state where plan was created
and there are no previously staged plans.
Then it distributes plan to all cluster nodes using fan-out mechanism.
Each node in turn markes dead nodes as decommissioned and if there
are planned changes for the node it saves plan in the local store.
Admin call is backed by debug recover apply-plan command when using
--host flag to work in half-online mode.
Release note: None
loqrecovery,admin: implement endpoint to check staged plans
This commit adds loss of quorum recovery verify call to admin
interface. Call allows querying loss of quorum recovery status
from all nodes of the cluster. It provides info about loss of
quorum recovery plans staged on each node.
Release note: None
State checkpoint is included in the PR as it only provide partial functionality of state needed for stage phase to work.
Fixes #93044
Fixes #74135
Touches #93043
When doing staging, cli would present following reports in happy case:
And allow overwriting of currently stages plans if need be: