clusterctl backup/restore #3441

moensch · 2020-08-03T19:57:41Z

Related slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1596471116438700

User Story

As an operator I would like to take backups of a workload cluster's CAPx resources on the management cluster in order to be able to restore this backup to a different management cluster in a disaster recovery scenario (total loss of management cluster).

Detailed Description

There is a lot of code in clusterctl move that ensures clusters are paused, objects are created in the correct order, and controller and owner references are set correctly.
All this exact same logic also applies to taking and restoring backups.

The idea would be to take a lot of code from /cmd/clusterctl/client/cluster/mover.go and /cmd/clusterctl/client/cluster/objectgraph.go, move some of it into a new library, and build backup and restore commands.

At the top level, I see the backup performing the following steps:

Pause the Cluster
Retrieve the UnstructuredList from a given namespace (same as mover.go)
Dump this list to a JSON file on disk

The restore would:

Read the UnstructuredList from file on disk (namespace can then be inferred from the objects in that list
Build the objectgraph
Use the new equivalent of getMoveSequence to figure out in which order to restore.
Restore the objects
Un-pause the Cluster

Anything else you would like to add:

Depending on how this code ends up structured, this could become a new public package which could be imported by something like a Velero plugin. This would make Velero inherently aware of CAPx without duplicating too much code.

/kind feature
/area clusterctl

The text was updated successfully, but these errors were encountered:

ncdc · 2020-08-03T20:02:53Z

cc @nrb @carlisia @ashish-amarnath

jichenjc · 2020-09-24T07:12:34Z

/assign

I can take a look at this

jichenjc · 2020-09-24T07:20:20Z

so basically, we should create a json file that contains all the information that we are going to do the move action, but the move action only occurs between bootstrap ==> workload cluster
while the desire use case of this backup/restore happens on both bootstrap (before move) and workload cluster (after move) ,correct?

fabriziopandini · 2020-09-24T10:59:07Z

It is not clear to me if we are going to implement two new top level commands or if we are going to make backup and restore as move options e.g.

clusterct move --to-file (backup)
clusterct move --from-file (restore)

However, I would break the implementation down into two logical parts.

The easiest to implement is backup, which is similar to dry-run with the exception it dumps all the resources in a file.
Restore instead is more complex because you have to rebuild the object graph from a file before triggering the move to logic.

Also, given that the target scenario is recovery from a disaster, I think the Pause/Unpause logic should not be triggered.
Definetly +1 to get this exposed as a library func

vincepri · 2020-09-24T15:53:24Z

We should probably first figure out the plan for move #3354

jichenjc · 2020-09-25T00:31:45Z

ok, I will wait for #3354 before work on this , thanks for the reminder @vincepri @fabriziopandini
or you think clusterct move --to-file (backup) can be implemented anyway?

fejta-bot · 2020-12-24T03:59:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fabriziopandini · 2020-12-30T09:33:05Z

/remove-lifecycle stale

fejta-bot · 2021-03-30T09:40:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

ashish-amarnath · 2021-03-30T15:11:00Z

/remove-lifecycle stale

vincepri · 2021-03-30T15:25:53Z

/lifecycle frozen

jpmcb · 2021-05-14T22:25:44Z

Hi all - @dvonthenen and I have started looking into this and will hopefully have something to contribute back soon.

Our approach would be similar to what's been discussed before.

In the Save method, we dump resources to file from the cluster.
In the Restore method, we would accept a glob to get the files with saved objects, create unstructured.Unstructured objects from the files using the clusterctl yaml package, and then apply those to the cluster.

fabriziopandini · 2021-05-16T19:47:20Z

@jpmcb thanks for the update.
While working at this, let's make sure we define what exact use case are we aim to support and which are the boundaries of this feature because personally I don't see anytime in feature clusterctl adding support fancy backup features like scheduled backups or backup to cloud storage etc.

Last but not least, we should consider that move is now going to include global resources (#3042 (comment)); this could make backup restore trickier given the namespaced nature of move.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/clusterctl Issues or PRs related to clusterctl labels Aug 3, 2020

ncdc added this to the Next milestone Aug 3, 2020

fabriziopandini mentioned this issue Sep 2, 2020

cluster move --dry-run ? #3574

Closed

k8s-ci-robot assigned jichenjc Sep 24, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 30, 2021

fabriziopandini mentioned this issue May 16, 2021

Re-define the scope of clusterctl move #3354

Closed

joshrosso mentioned this issue Jun 7, 2021

Merge clusterctl backup/restore capabilities upstream. vmware-tanzu/community-edition#716

Closed

This was referenced Jun 9, 2021

✨ Implements Save and Restore for objectMover #4786

Closed

✨ Implements Backup and Restore for objectMover #4808

Merged

k8s-ci-robot closed this as completed in #4808 Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterctl backup/restore #3441

clusterctl backup/restore #3441

moensch commented Aug 3, 2020 •

edited

Loading

ncdc commented Aug 3, 2020 •

edited

Loading

jichenjc commented Sep 24, 2020

jichenjc commented Sep 24, 2020

fabriziopandini commented Sep 24, 2020

vincepri commented Sep 24, 2020

jichenjc commented Sep 25, 2020 •

edited

Loading

fejta-bot commented Dec 24, 2020

fabriziopandini commented Dec 30, 2020

fejta-bot commented Mar 30, 2021

ashish-amarnath commented Mar 30, 2021

vincepri commented Mar 30, 2021

jpmcb commented May 14, 2021

fabriziopandini commented May 16, 2021 •

edited

Loading

clusterctl backup/restore #3441

clusterctl backup/restore #3441

Comments

moensch commented Aug 3, 2020 • edited Loading

ncdc commented Aug 3, 2020 • edited Loading

jichenjc commented Sep 24, 2020

jichenjc commented Sep 24, 2020

fabriziopandini commented Sep 24, 2020

vincepri commented Sep 24, 2020

jichenjc commented Sep 25, 2020 • edited Loading

fejta-bot commented Dec 24, 2020

fabriziopandini commented Dec 30, 2020

fejta-bot commented Mar 30, 2021

ashish-amarnath commented Mar 30, 2021

vincepri commented Mar 30, 2021

jpmcb commented May 14, 2021

fabriziopandini commented May 16, 2021 • edited Loading

moensch commented Aug 3, 2020 •

edited

Loading

ncdc commented Aug 3, 2020 •

edited

Loading

jichenjc commented Sep 25, 2020 •

edited

Loading

fabriziopandini commented May 16, 2021 •

edited

Loading