-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loqrecovery,server: apply staged loqrecovery plans on start #95582
Conversation
0ef1ed8
to
b4d9c47
Compare
2d4ed66
to
174daf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just dumping what I got through before some meetings. Will pick this back up later.
(gogoproto.nullable) = false, | ||
(gogoproto.stdtime) = true]; | ||
// If most recent recovery plan application failed, Error will contain | ||
// aggregated error messages containing all encountered errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there ever be more than one error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to have more than one commit error in case we have multiple stores and batch commit fails. At this point we collect all commit errors since we can't rollback anymore.
pkg/server/loss_of_quorum.go
Outdated
) { | ||
var cleanup loqrecoverypb.DeferredRecoveryActions | ||
err := stores.VisitStores(func(s *kvserver.Store) error { | ||
c, found, err := loqrecovery.ConsumeCleanupActionsInfo(ctx, s.Engine()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wait to remove the action until we've successfully decommissioned the nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is just trying once and relying on the fact that all nodes would try to do this so one will succeed. Not a strong guarantee. Alternatively, we could keep trying for some time, maybe longer than theoretical liveness range recovery time e.g. replication queue should pick up liveness within 10 min, then it would upreplicate (which could be throttled potentially by indeterminate amount), then we'll pick it up within seconds and update.
I don't like idea of keeping this entry if we can't update liveness. It would stay there till next restart so you'll have this action silently "scheduled". It is idempotent now, so technically fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We'll need to get these nodes decommissioned one way or another (otherwise they could rejoin the cluster via newly added nodes or prevent later cluster upgrades), so we should keep retrying with exponential backoff and make sure the actions are idempotent. We only remove the action when it's successfully applied. In the worst case, the operator can manually perform e.g. the decommission action to complete the job.
Commit adds a helper to join multiple indentifiers into a string which is helpful in cli commands. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the rest. All nits, except for the cleanup retries discussed above.
80a1cfd
to
734f7a9
Compare
efb2efd
to
88f9746
Compare
This commit adds loss of quorum recovery plan application functionality to server startup. Plans are staged during application phase and then applied when rolling restart of affected nodes is performed. Plan application peforms same actions that are performed by debug recovery apply-plan on offline storage. It is also updating loss of quorum recovery status stored in a store local key. Release note: None
88f9746
to
9c6b9e9
Compare
return err | ||
} | ||
|
||
if err := planStore.RemovePlan(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this before it's successfully applied? Is it better to retry on the next restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already reported it as failed though. If we have a replica that updated after we created plan then we'll end up with plan that would fail forever not being able to apply. I think it is safer to drop it. And run another recovery if needed.
bors r=erikgrinaker |
Build succeeded: |
This commit adds loss of quorum recovery plan application functionality
to server startup. Plans are staged during application phase and then
applied when rolling restart of affected nodes is performed.
Plan application peforms same actions that are performed by debug
recovery apply-plan on offline storage. It is also updating loss of
quorum recovery status stored in a store local key.
Release note: None
Fixes #93121