Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loqrecovery,server: apply staged loqrecovery plans on start #95582

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Jan 20, 2023

This commit adds loss of quorum recovery plan application functionality
to server startup. Plans are staged during application phase and then
applied when rolling restart of affected nodes is performed.
Plan application peforms same actions that are performed by debug
recovery apply-plan on offline storage. It is also updating loss of
quorum recovery status stored in a store local key.

Release note: None

Fixes #93121

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 changed the title loqrecovery: add storage for staged plans loqrecovery,server: apply staged loqrecovery plans on start Jan 20, 2023
@aliher1911 aliher1911 force-pushed the loq_05online_apply_plan branch 6 times, most recently from 0ef1ed8 to b4d9c47 Compare January 23, 2023 18:34
@aliher1911 aliher1911 self-assigned this Jan 23, 2023
@aliher1911 aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from 2d4ed66 to 174daf5 Compare January 24, 2023 21:09
@aliher1911 aliher1911 marked this pull request as ready for review January 24, 2023 21:10
@aliher1911 aliher1911 requested a review from a team as a code owner January 24, 2023 21:10
@aliher1911 aliher1911 requested a review from a team January 24, 2023 21:10
@aliher1911 aliher1911 requested a review from a team as a code owner January 24, 2023 21:10
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dumping what I got through before some meetings. Will pick this back up later.

pkg/util/strutil/util.go Outdated Show resolved Hide resolved
pkg/keys/constants.go Outdated Show resolved Hide resolved
(gogoproto.nullable) = false,
(gogoproto.stdtime) = true];
// If most recent recovery plan application failed, Error will contain
// aggregated error messages containing all encountered errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there ever be more than one error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to have more than one commit error in case we have multiple stores and batch commit fails. At this point we collect all commit errors since we can't rollback anymore.

pkg/server/server.go Outdated Show resolved Hide resolved
pkg/server/server.go Outdated Show resolved Hide resolved
) {
var cleanup loqrecoverypb.DeferredRecoveryActions
err := stores.VisitStores(func(s *kvserver.Store) error {
c, found, err := loqrecovery.ConsumeCleanupActionsInfo(ctx, s.Engine())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait to remove the action until we've successfully decommissioned the nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is just trying once and relying on the fact that all nodes would try to do this so one will succeed. Not a strong guarantee. Alternatively, we could keep trying for some time, maybe longer than theoretical liveness range recovery time e.g. replication queue should pick up liveness within 10 min, then it would upreplicate (which could be throttled potentially by indeterminate amount), then we'll pick it up within seconds and update.
I don't like idea of keeping this entry if we can't update liveness. It would stay there till next restart so you'll have this action silently "scheduled". It is idempotent now, so technically fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We'll need to get these nodes decommissioned one way or another (otherwise they could rejoin the cluster via newly added nodes or prevent later cluster upgrades), so we should keep retrying with exponential backoff and make sure the actions are idempotent. We only remove the action when it's successfully applied. In the worst case, the operator can manually perform e.g. the decommission action to complete the job.

pkg/server/loss_of_quorum.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/record.go Outdated Show resolved Hide resolved
Commit adds a helper to join multiple indentifiers into a string
which is helpful in cli commands.

Release note: None
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the rest. All nits, except for the cleanup retries discussed above.

pkg/kv/kvserver/loqrecovery/apply.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/loqrecovery/record.go Outdated Show resolved Hide resolved
@aliher1911 aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from 80a1cfd to 734f7a9 Compare January 30, 2023 20:35
@aliher1911 aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from efb2efd to 88f9746 Compare January 30, 2023 23:39
This commit adds loss of quorum recovery plan application functionality
to server startup. Plans are staged during application phase and then
applied when rolling restart of affected nodes is performed.
Plan application peforms same actions that are performed by debug
recovery apply-plan on offline storage. It is also updating loss of
quorum recovery status stored in a store local key.

Release note: None
@aliher1911 aliher1911 force-pushed the loq_05online_apply_plan branch from 88f9746 to 9c6b9e9 Compare January 31, 2023 10:43
return err
}

if err := planStore.RemovePlan(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this before it's successfully applied? Is it better to retry on the next restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already reported it as failed though. If we have a replica that updated after we created plan then we'll end up with plan that would fail forever not being able to apply. I think it is safer to drop it. And run another recovery if needed.

@aliher1911
Copy link
Contributor Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Feb 1, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loqrecovery: apply staged plans on node restart
3 participants