loqrecovery,server: apply staged loqrecovery plans on start #95582

aliher1911 · 2023-01-20T15:22:47Z

This commit adds loss of quorum recovery plan application functionality
to server startup. Plans are staged during application phase and then
applied when rolling restart of affected nodes is performed.
Plan application peforms same actions that are performed by debug
recovery apply-plan on offline storage. It is also updating loss of
quorum recovery status stored in a store local key.

Release note: None

Fixes #93121

cockroach-teamcity · 2023-01-20T15:23:03Z

This change is

erikgrinaker

Just dumping what I got through before some meetings. Will pick this back up later.

pkg/util/strutil/util.go

pkg/keys/constants.go

pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto

erikgrinaker · 2023-01-26T14:32:59Z

pkg/kv/kvserver/loqrecovery/loqrecoverypb/recovery.proto

+    (gogoproto.nullable) = false,
+    (gogoproto.stdtime) = true];
+  // If most recent recovery plan application failed, Error will contain
+  // aggregated error messages containing all encountered errors.


Will there ever be more than one error here?

It is possible to have more than one commit error in case we have multiple stores and batch commit fails. At this point we collect all commit errors since we can't rollback anymore.

pkg/server/server.go

erikgrinaker · 2023-01-26T15:04:49Z

pkg/server/loss_of_quorum.go

+) {
+	var cleanup loqrecoverypb.DeferredRecoveryActions
+	err := stores.VisitStores(func(s *kvserver.Store) error {
+		c, found, err := loqrecovery.ConsumeCleanupActionsInfo(ctx, s.Engine())


Should we wait to remove the action until we've successfully decommissioned the nodes?

Currently it is just trying once and relying on the fact that all nodes would try to do this so one will succeed. Not a strong guarantee. Alternatively, we could keep trying for some time, maybe longer than theoretical liveness range recovery time e.g. replication queue should pick up liveness within 10 min, then it would upreplicate (which could be throttled potentially by indeterminate amount), then we'll pick it up within seconds and update.
I don't like idea of keeping this entry if we can't update liveness. It would stay there till next restart so you'll have this action silently "scheduled". It is idempotent now, so technically fine.

Discussed offline. We'll need to get these nodes decommissioned one way or another (otherwise they could rejoin the cluster via newly added nodes or prevent later cluster upgrades), so we should keep retrying with exponential backoff and make sure the actions are idempotent. We only remove the action when it's successfully applied. In the worst case, the operator can manually perform e.g. the decommission action to complete the job.

pkg/server/loss_of_quorum.go

pkg/kv/kvserver/loqrecovery/record.go

Commit adds a helper to join multiple indentifiers into a string which is helpful in cli commands. Release note: None

erikgrinaker

Reviewed the rest. All nits, except for the cleanup retries discussed above.

pkg/kv/kvserver/loqrecovery/apply.go

pkg/kv/kvserver/loqrecovery/record.go

This commit adds loss of quorum recovery plan application functionality to server startup. Plans are staged during application phase and then applied when rolling restart of affected nodes is performed. Plan application peforms same actions that are performed by debug recovery apply-plan on offline storage. It is also updating loss of quorum recovery status stored in a store local key. Release note: None

erikgrinaker · 2023-01-31T13:08:04Z

pkg/kv/kvserver/loqrecovery/apply.go

+		return err
+	}
+
+	if err := planStore.RemovePlan(); err != nil {


Should we remove this before it's successfully applied? Is it better to retry on the next restart?

We already reported it as failed though. If we have a replica that updated after we created plan then we'll end up with plan that would fail forever not being able to apply. I think it is safer to drop it. And run another recovery if needed.

aliher1911 · 2023-02-01T09:28:16Z

bors r=erikgrinaker

craig · 2023-02-01T11:03:54Z

Build succeeded:

Bazel Essential CI (Cockroach)

aliher1911 changed the title ~~loqrecovery: add storage for staged plans~~ loqrecovery,server: apply staged loqrecovery plans on start Jan 20, 2023

aliher1911 force-pushed the loq_05online_apply_plan branch 6 times, most recently from 0ef1ed8 to b4d9c47 Compare January 23, 2023 18:34

aliher1911 self-assigned this Jan 23, 2023

aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from 2d4ed66 to 174daf5 Compare January 24, 2023 21:09

aliher1911 marked this pull request as ready for review January 24, 2023 21:10

aliher1911 requested a review from a team as a code owner January 24, 2023 21:10

aliher1911 requested a review from a team January 24, 2023 21:10

aliher1911 requested a review from a team as a code owner January 24, 2023 21:10

aliher1911 requested a review from erikgrinaker January 26, 2023 11:15

erikgrinaker reviewed Jan 26, 2023

View reviewed changes

strutil: add helper to join slices of identifiers

62843cc

Commit adds a helper to join multiple indentifiers into a string which is helpful in cli commands. Release note: None

erikgrinaker reviewed Jan 30, 2023

View reviewed changes

pkg/kv/kvserver/loqrecovery/apply.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/loqrecovery/record.go Outdated Show resolved Hide resolved

aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from 80a1cfd to 734f7a9 Compare January 30, 2023 20:35

aliher1911 requested a review from erikgrinaker January 30, 2023 20:35

aliher1911 force-pushed the loq_05online_apply_plan branch 3 times, most recently from efb2efd to 88f9746 Compare January 30, 2023 23:39

aliher1911 force-pushed the loq_05online_apply_plan branch from 88f9746 to 9c6b9e9 Compare January 31, 2023 10:43

erikgrinaker approved these changes Jan 31, 2023

View reviewed changes

craig bot merged commit 8b35f22 into cockroachdb:master Feb 1, 2023

erikgrinaker mentioned this pull request Feb 15, 2023

loqrecovery: automatically decommission lost nodes to prevent replicas from rejoining #74681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loqrecovery,server: apply staged loqrecovery plans on start #95582

loqrecovery,server: apply staged loqrecovery plans on start #95582

aliher1911 commented Jan 20, 2023 •

edited

Loading

cockroach-teamcity commented Jan 20, 2023

erikgrinaker left a comment

erikgrinaker Jan 26, 2023

aliher1911 Jan 30, 2023

erikgrinaker Jan 26, 2023

aliher1911 Jan 30, 2023

erikgrinaker Jan 30, 2023

erikgrinaker left a comment

erikgrinaker Jan 31, 2023

aliher1911 Jan 31, 2023

aliher1911 commented Feb 1, 2023

craig bot commented Feb 1, 2023

loqrecovery,server: apply staged loqrecovery plans on start #95582

loqrecovery,server: apply staged loqrecovery plans on start #95582

Conversation

aliher1911 commented Jan 20, 2023 • edited Loading

cockroach-teamcity commented Jan 20, 2023

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker Jan 26, 2023

Choose a reason for hiding this comment

aliher1911 Jan 30, 2023

Choose a reason for hiding this comment

erikgrinaker Jan 26, 2023

Choose a reason for hiding this comment

aliher1911 Jan 30, 2023

Choose a reason for hiding this comment

erikgrinaker Jan 30, 2023

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker Jan 31, 2023

Choose a reason for hiding this comment

aliher1911 Jan 31, 2023

Choose a reason for hiding this comment

aliher1911 commented Feb 1, 2023

craig bot commented Feb 1, 2023

aliher1911 commented Jan 20, 2023 •

edited

Loading