Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterctl backup/restore #3441

Closed
moensch opened this issue Aug 3, 2020 · 13 comments · Fixed by #4808
Closed

clusterctl backup/restore #3441

moensch opened this issue Aug 3, 2020 · 13 comments · Fixed by #4808
Assignees
Labels
area/clusterctl Issues or PRs related to clusterctl kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@moensch
Copy link

moensch commented Aug 3, 2020

Related slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1596471116438700

User Story

As an operator I would like to take backups of a workload cluster's CAPx resources on the management cluster in order to be able to restore this backup to a different management cluster in a disaster recovery scenario (total loss of management cluster).

Detailed Description

There is a lot of code in clusterctl move that ensures clusters are paused, objects are created in the correct order, and controller and owner references are set correctly.
All this exact same logic also applies to taking and restoring backups.

The idea would be to take a lot of code from /cmd/clusterctl/client/cluster/mover.go and /cmd/clusterctl/client/cluster/objectgraph.go, move some of it into a new library, and build backup and restore commands.

At the top level, I see the backup performing the following steps:

  1. Pause the Cluster
  2. Retrieve the UnstructuredList from a given namespace (same as mover.go)
  3. Dump this list to a JSON file on disk

The restore would:

  1. Read the UnstructuredList from file on disk (namespace can then be inferred from the objects in that list
  2. Build the objectgraph
  3. Use the new equivalent of getMoveSequence to figure out in which order to restore.
  4. Restore the objects
  5. Un-pause the Cluster

Anything else you would like to add:

Depending on how this code ends up structured, this could become a new public package which could be imported by something like a Velero plugin. This would make Velero inherently aware of CAPx without duplicating too much code.

/kind feature
/area clusterctl

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/clusterctl Issues or PRs related to clusterctl labels Aug 3, 2020
@ncdc ncdc added this to the Next milestone Aug 3, 2020
@ncdc
Copy link
Contributor

ncdc commented Aug 3, 2020

cc @nrb @carlisia @ashish-amarnath

@jichenjc
Copy link
Contributor

/assign

I can take a look at this

@jichenjc
Copy link
Contributor

so basically, we should create a json file that contains all the information that we are going to do the move action, but the move action only occurs between bootstrap ==> workload cluster
while the desire use case of this backup/restore happens on both bootstrap (before move) and workload cluster (after move) ,correct?

@fabriziopandini
Copy link
Member

It is not clear to me if we are going to implement two new top level commands or if we are going to make backup and restore as move options e.g.

clusterct move --to-file (backup)
clusterct move --from-file (restore)

However, I would break the implementation down into two logical parts.

  • The easiest to implement is backup, which is similar to dry-run with the exception it dumps all the resources in a file.
  • Restore instead is more complex because you have to rebuild the object graph from a file before triggering the move to logic.

Also, given that the target scenario is recovery from a disaster, I think the Pause/Unpause logic should not be triggered.
Definetly +1 to get this exposed as a library func

@vincepri
Copy link
Member

We should probably first figure out the plan for move #3354

@jichenjc
Copy link
Contributor

jichenjc commented Sep 25, 2020

ok, I will wait for #3354 before work on this , thanks for the reminder @vincepri @fabriziopandini
or you think clusterct move --to-file (backup) can be implemented anyway?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2020
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021
@ashish-amarnath
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2021
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 30, 2021
@jpmcb
Copy link
Contributor

jpmcb commented May 14, 2021

Hi all - @dvonthenen and I have started looking into this and will hopefully have something to contribute back soon.

Our approach would be similar to what's been discussed before.

  • In the Save method, we dump resources to file from the cluster.
  • In the Restore method, we would accept a glob to get the files with saved objects, create unstructured.Unstructured objects from the files using the clusterctl yaml package, and then apply those to the cluster.

@fabriziopandini
Copy link
Member

fabriziopandini commented May 16, 2021

@jpmcb thanks for the update.
While working at this, let's make sure we define what exact use case are we aim to support and which are the boundaries of this feature because personally I don't see anytime in feature clusterctl adding support fancy backup features like scheduled backups or backup to cloud storage etc.

Last but not least, we should consider that move is now going to include global resources (#3042 (comment)); this could make backup restore trickier given the namespaced nature of move.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants