Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup and Restore Implementation #743

Closed
2 of 6 tasks
costrouc opened this issue Jul 28, 2021 · 18 comments
Closed
2 of 6 tasks

Backup and Restore Implementation #743

costrouc opened this issue Jul 28, 2021 · 18 comments
Labels
needs: discussion 💬 Needs discussion with the rest of the team project: JATIC Work item needed for the JATIC project type: enhancement 💅🏼 New feature or request

Comments

@costrouc
Copy link
Member

costrouc commented Jul 28, 2021

Summary

QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.

Proposed implementation

We realize this is a large issue and it will be most likely easiest to approach this problem in steps.

The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.

velero:
  enabled: true/false
  schedule: "0 0 * * *"
  credentials:
     ...

Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup and qhub restore command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.

Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.

Additionally there should be documentation added for the admin and dev guide.

Acceptance Criteria

  • upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
  • qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
  • qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).
  • Velero is installed via a helm chart instead of the velero binary

Tasks to complete

Related to

@costrouc costrouc added the type: enhancement 💅🏼 New feature or request label Jul 28, 2021
@costrouc
Copy link
Member Author

Some good point raised during our meeting. The velero backup will only apply to block storage pv/pvcs deployed within kubernetes on the specific cloud providers. We do use efs from aws and this would be out of scope.

Additionally I forsee the conversation around storing credentials within the qhub-config.yaml for s3 bucket access. For now assume that storing credentials within the bucket is okay. This is because other future PRs will solve storing secrets in the configuraition.

@costrouc
Copy link
Member Author

costrouc commented Aug 26, 2021

Assigning:

@toonarmycaptain
Copy link
Contributor

Some issues raised:

  • Whether the VMWare/Tanzu Helm chart will fit our needs, and how much modification it will need.
  • Dependency on VMWare to maintain (and keep open source) Velero and the Helm chart
  • DigitalOcean is not supported by Velero, and while there is a community supported plugin (in DO's github) it does not appear to be under regular development/maintenance.
  • Amount of work/maintenance necessary to make and keep a Velero solution in qhub cloud agnostic

@toonarmycaptain
Copy link
Contributor

toonarmycaptain commented Sep 9, 2021

Looking at current alternatives to Valero:

Cost
eg Paid/limited free/free
Source
eg OSS/source available/closed source

Features
eg Full backup/etcd only
Tool/platform Presently Maintained
Portworx PX-Backup Limited 5TB/5 nodes/30 vol No, relies on OSS libs Full Tool Yes
Kasten Limited to 10 nodes No Full Tool Yes
Kubedr
->CloudCasa
Free

OSS - Apache

etcd only

Tool

Alpha/unmaintained - 03/2020
Rancher/Longhorn Free OSS Full Platform/storage tool Yes
Stash by AppsCode Free Community Edition OSS - must apply for 1 yr free license Limited - no local/auto/batch backup in Free Edition Tool Yes

tl;dr the market is very limited vis á vis FOSS alternatives to Velero.

@costrouc
Copy link
Member Author

costrouc commented Sep 16, 2021

Wanted to document a solution I got working on prem via minikube and via digital ocean. This seems to be cloud agnostic for backups which seems promising. In addition I didn't realize how complete the velero backups are. They include all of the resources as well and give strong controls on the backup.

minikube start --driver=docker --kubernetes-version=v1.21.3

To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup.

apiVersion: v1
kind: Service
metadata:
  name: minio
spec:
  type: NodePort
  ports:
  - name: "9000"
    nodePort: 30900
    port: 9000
    targetPort: 9000
  - name: "9001"
    nodePort: 30901
    port: 9001
    targetPort: 9001
  selector:
    app: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  labels:
    app: minio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
        - name: minio
          image: minio/minio:RELEASE.2021-08-25T00-41-18Z
          args:
            - "-c"
            - "mkdir -p /data/velero && /usr/bin/minio server /data --console-address 0.0.0.0:9001"
          command:
            - "sh"
          env:
            - name: MINIO_ACCESS_KEY
              value: admin
            - name: MINIO_SECRET_KEY
              value: password
          ports:
            - containerPort: 9000
            - containerPort: 9001
          volumeMounts:
            - mountPath: /data
              name: minio-claim
      restartPolicy: Always
      volumes:
        - name: minio-claim
          persistentVolumeClaim:
            claimName: minio-claim

and then an example application to test the backup with

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pod-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: hellopod
spec:
  containers:
    - name: hello
      image: busybox
      imagePullPolicy: IfNotPresent
      command:
      - /bin/sh
      - -c
      - "date >> /data/example.txt; sleep 100000"
      volumeMounts:
        - mountPath: /data
          name: pod-claim
  restartPolicy: OnFailure
  volumes:
    - name: pod-claim
      persistentVolumeClaim:
        claimName: pod-claim

Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it.

[default]
aws_access_key_id = admin
aws_secret_access_key = password

And then we download velero

wget https://github.com/vmware-tanzu/velero/releases/download/v1.6.3/velero-v1.6.3-linux-amd64.tar.gz
tar -xf *.tar.gz
cd velero-*

./velero install --provider=aws --plugins velero/velero-plugin-for-aws:v1.0.0 --use-restic --use-volume-snapshots=false --bucket=velero --secret-file /tmp/velero/credentials.txt --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.default.svc:9000

Finally lets demonstrate a backup

./velero backup create anexample --default-volumes-to-restic=true

You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via minikube ip and the port is 30900 additionally you can also access the ui via port forwarding https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/. I also breifly tested deleting the pod resource and then restoring the volume. This seemed to work though I didn't test this as much. However, the backup is clearly happening on DO and minikube. On prem velero has issues with hostPaths as pvc volumes however outside of testing I would consider this a rare circumstance since for any true multinode kubernetes deploements hostPaths cannot work.

This also looks like it will be able to backup efs and cloud specific pvcs 😄. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some.

@iameskild
Copy link
Member

At a high-level it appears that a Velero + Restic backup will mostly likely work for our purposes. I started my testing on a minikube cluster but kept running into errors (they might still be user-errors) so I decided to repurpose an existing AWS deployment I was using; I had much better success backing up and restoring on the AWS QHub cluster (steps outlined below). There are still a handful of things to test and consider:

  • Test with main
    • ensure keycloak postgresql db is also properly restored
  • Test on other cloud providers
  • Explore using Helm chart to backup / restore
  • Explore how end-user would go about restoring system
    • CI/CD workflow might be most convenient and aligns with the infrastructure as code paradym

Steps

qhub --version
0.3.13

# brew install velero 
velero version 
Client:
	Version: v1.7.0
	Git commit: -
Server:
	Version: v1.7.0

velero install
--provider=aws
--plugins=velero/velero-plugin-for-aws:v1.3.0
--use-restic
--default-volumes-to-restic=true
--bucket=$BUCKET
--secret-file ./credentials.txt
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com
--wait
--snapshot-location-config region=$REGION

velero backup create test --include-namespaces=dev --wait

# Tear down QHub 
python -m qhub destroy -c qhub-config.yaml

# Redeploy with same config file
python -m qhub deploy -c qhub-config.yaml

# Prepare for NFS restore
# - delete nfs-mount-dev-share, conda-store-dev-share PVCs
# - delete jupyterhub-sftp Deployment

# Update PV reclaim-status from "Released" to "Available"
k patch pv nfs-mount-dev-share -p '{spec:{claimRef: null}}'
k patch pv conda-store-dev-share -p '{spec:{claimRef: null}}'

velero create restore test-restore --from-backup=test

Obervations

  • If the resource is already online and available, then the restore will log a warning, skip it and move on
  • Upon restore, three dask-schedulers with a handful of workers each, were also restored
  • This restore completed with a Partil-Fail status due to it's inability to restore the jupyterhub-sftp volume home
    • The error message states that the error is related to there already being a shared folder (see below)
      • Although I haven't tested it yet, I suspect if we delete the shared folder (even if it's empty) prior to the restore, we can get the restic restore to complete successfully
Errors:
  Velero:  pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://s3.eu-west-2.amazonaws.com/eaeqhubbu/restic/dev --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic bd6d4b69 --target=., stdout=restoring <Snapshot bd6d4b69 of [/host_pods/f3eaeaa7-7755-4c99-b79c-4fbb5029ee55/volumes/kubernetes.io~nfs/nfs-mount-dev-share] at 2021-11-11 23:39:55.972809323 +0000 UTC by root@velero> to .
, stderr=ignoring error for /home/iameskild/shared: Symlink: symlink /home/shared /host_pods/003f3c3d-d922-4784-a26c-ea60dd639775/volumes/kubernetes.io~nfs/nfs-mount-dev-share/home/iameskild/shared: file exists
Fatal: There were 1 errors

@tonyfast tonyfast added this to the Release v0.4.0 milestone Nov 16, 2021
@tylerpotts
Copy link
Contributor

tylerpotts commented Nov 17, 2021

The is verified working on AWS and GCP:

Backup

In order to specify a volume for restic restoration, we need to annotate a pod with backup.velero.io/backup-volumes: <pods_name_for_persistentvolume>. I decided to do this by creating a pod specifically for this purpose. With the following saved as custom_pod.yaml I added it to the cluster with kubectl apply -f custom_pod.yaml

kind: Pod
apiVersion: v1
metadata:
  name: restic-placeholder
  namespace: dev
  annotations:
    backup.velero.io/backup-volumes: home
spec:
  volumes:
    - name: home
      persistentVolumeClaim:
        claimName: "nfs-mount-dev-share"
  containers:
    - name: placeholder
      image: ubuntu
      command: ["sleep", "36000000000000"]
      volumeMounts:
        - mountPath: "/data"
          name: home

To avoid errors on mounts that don't need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:

kubectl label pvc conda-store-dev-share velero.io/exclude-from-backup=true -n dev
kubectl label pvc hub-db-dir velero.io/exclude-from-backup=true -n dev
kubectl label pvc qhub-conda-store-storage velero.io/exclude-from-backup=true -n dev

With this setup, velero can be installed with the default-volumes-to-restic=false:

velero install \
--provider=aws \
--plugins=velero/velero-plugin-for-aws:v1.3.0 \
--use-restic \
--default-volumes-to-restic=false \
--bucket=$BUCKET \
--secret-file ./credentials.txt \
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com \
--wait \
--snapshot-location-config region=$REGION

The backup is created with:

velero backup create qhub-backup --include-namespaces=dev --wait

Restore

Note that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the nfs-mount-dev-share with the commands below:

kubectl delete deployments qhub-jupyterhub-sftp -n dev
kubectl delete pod restic-placeholder -n dev
kubectl delete pvc nfs-mount-dev-share -n dev
kubectl patch pv nfs-mount-dev-share -p '{"spec":{"claimRef": null}}'

With these gone, the restore can be initiatied with:

velero restore create qhub-restore --from-backup qhub-backup

Note that the restore will say that it partially failed. This is because there is already a symlink for /home/shared. However, data in the user directories as well as the shared directories gets restored as expected.

@iameskild iameskild removed this from the Release v0.4.0 milestone Nov 19, 2021
@tylerpotts
Copy link
Contributor

To copy the backed up data for home and shared to the current working directory, run the following command:

restic -r s3:s3.amazonaws.com/<backup_bucket>/restic/dev --verbose=2 restore latest --target .

This will prompt for a password, which will always be static-passw0rd

@iameskild iameskild added the needs: discussion 💬 Needs discussion with the rest of the team label Apr 14, 2022
@iameskild
Copy link
Member

Now that we have Argo-Workflows enabled, we can run backup and restore workflows much more easily; with a few small updates, we can schedule backups as cron-workflows.

This backup/restore solution also relies on restic to perform the actual backup and restore.

Requirements and implementation

Here is an example of how we might want to run backup and restores.

We will need:

  1. an image that contains restic and the cloud-specific CLI (gcloud, awscli, etc.)
  2. a cloud provider service account with the ability to read/write/create storage buckets/blobs
  3. secrets for each of the following:
    • RESTIC_REPOSITORY
    • RESTIC_PASSWORD
    • cloud specific credentials such as
      • GOOGLE_APPLICATION_CREDENTIALS - specific to the service account mentioned in 2.
      • GOOGLE_PROJECT_ID
  4. one backup workflow and one restore workflow
    • as mentioned above, it would make sense to have the backup workflow run on a schedule (i.e. everyday at midnight)

Resource details

I created this image so I could test this proposed solution (more on the results below). However in the long-term, we would likely require an image to be built and pushed to our open registries for each of the cloud providers that we support (AWS, Azure, DO, GCP).

Again, to test the feasibility of this solution, I created the following secrets:

apiVersion: v1
kind: Secret
metadata:
  name: google-application-credentials
  namespace: dev
type: Opaque
data:
  GOOGLE_APPLICATION_CREDENTIALS: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: google-project-id
  namespace: dev
type: Opaque
data:
  GOOGLE_PROJECT_ID: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: restic-repo
  namespace: dev
type: Opaque
data:
  RESTIC_REPOSITORY: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: restic-password
  namespace: dev
type: Opaque
data:
  RESTIC_PASSWORD: --- 

And then the actual workflows themselves.

Backup workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: backup-workflow
  namespace: dev
spec:
  entrypoint: backup
  volumes:
  - name: google-application-credentials
    secret:
      secretName: google-application-credentials
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: "jupyterhub-dev-share"  
  templates:
  - name: backup
    container:
      # image I created above
      image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
      env:
      - name: GOOGLE_PROJECT_ID  
        valueFrom:
          secretKeyRef:
            name: google-project-id
            key: GOOGLE_PROJECT_ID
      - name: RESTIC_REPOSITORY
        valueFrom:
          secretKeyRef:
            name: restic-repo
            key: RESTIC_REPOSITORY
      - name: RESTIC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: restic-password
            key: RESTIC_PASSWORD
      volumeMounts:
        - mountPath: "/var/secrets/google"
          name: google-application-credentials
       # mount the NFS drive
        - mountPath: "/exports"
          name: nfs-volume
     command: [sh, -c]
      args: ['
        gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
        restic init;
        restic backup /exports 
      ']

Restore workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: restore-workflow
  namespace: dev
spec:
  entrypoint: restore
  volumes:
  - name: google-application-credentials
    secret:
      secretName: google-application-credentials
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: "jupyterhub-dev-share"  
  templates:
  - name: restore
    container:
     # again, same image I created above
      image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
      env:
      - name: GOOGLE_PROJECT_ID  
        valueFrom:
          secretKeyRef:
            name: google-project-id
            key: GOOGLE_PROJECT_ID
      - name: RESTIC_REPOSITORY
        valueFrom:
          secretKeyRef:
            name: restic-repo
            key: RESTIC_REPOSITORY
      - name: RESTIC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: restic-password
            key: RESTIC_PASSWORD
      volumeMounts:
        - mountPath: "/var/secrets/google"
          name: google-application-credentials
       # mount the NFS drive
        - mountPath: "/exports"
          name: nfs-volume
      command: [sh, -c]
      args: ['
        gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
        restic restore latest --target=/
        ']

Results

I have successfully tested this solution - including both the backup and restore steps - on a live cluster running on GCP with a backup residing on GCS 🎉

Notes and open items

A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the SOPS RFD and Vault RFD).

Obviously this is just a POC and will need to converted to a Terraform script, that said, this solution looks very promising and I am curious what the rest of the team thinks.

@dharhas
Copy link
Member

dharhas commented Feb 16, 2023

Just so we don't forget. A full backup and restore needs to also take into account keycloak and conda-store databases.

@costrouc
Copy link
Member Author

With this in the works now there is pressure for me to develop a backup/restore solution for conda-store 🙂.

@costrouc
Copy link
Member Author

A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the nebari-dev/governance#29 and nebari-dev/governance#32).

Absolutely. Yes we need a proper secret storage before adopting something like this to avoid leaking secrets in the argo workflows.

@iameskild
Copy link
Member

@trallard @dharhas @costrouc Is the backup/restore feature something worth including in a future roadmap? Several of our recent releases have required users to backup/restore their data and this would make that process a lot smoother.

@costrouc
Copy link
Member Author

@iameskild yes this is something that should be included in the future roadmap. I'm going to suggest that we use the extension PR work to add a subcommand for backup and restore in a separate nebari repository.

I think that backup/restore is something that we will need to incrementally improve independent of nebari. Also would allow us to release more frequently. I see several iterations that we should aim for:

What to backup:

  • shared directory, group/user data
  • conda-store state
  • keycloak state

Where to backup to:

  • directory when nebari backup is run returning large tarball
  • external s3 bucket

restore should have these similar requirements.

Priority in my mind:

  • backup/restore command which can backup a shared directory to local directory and then restore the state
  • backup/restore additionally keycloak
  • backup/restore to external s3 bucket
  • backup/restore conda-store as well

@iameskild
Copy link
Member

iameskild commented Aug 17, 2023

@costrouc creating these as subcommands makes a lot of sense!

And to confirm, we will be relying on the kubernetes (and keycloak) python client directly and won't be using terraform?

And would the NFS backup be a single tar.gz file? Perhaps we could look into restic again. The benefits include only backing up the diff so it would be very quick after the initial backup. As I've done elsewhere, we can perform backups for individual directories (users) so in the event of a failure during the restore, we can pick back up more reliably.

@iameskild
Copy link
Member

We could also include a backup gitops workflow that runs on a daily scheduler.

@pavithraes pavithraes moved this from New 📬 to Planned 💾 in 🪴 Nebari Project Management Aug 25, 2023
@pavithraes pavithraes moved this from TODO 📬 to Ready 🔔 in 🪴 Nebari Project Management Sep 8, 2023
@pavithraes pavithraes added the project: JATIC Work item needed for the JATIC project label Sep 13, 2023
@kcpevey kcpevey added this to the Next Release milestone Jan 23, 2024
@kcpevey
Copy link
Contributor

kcpevey commented Jan 31, 2024

Will this include extension data such as mlflow?

This will need to be tested on both AWS and AWS GovCloud for JATIC

@pavithraes pavithraes modified the milestones: 2024.2.1, Release Q2 2024 Feb 16, 2024
@viniciusdc
Copy link
Contributor

superseded by #2648

@github-project-automation github-project-automation bot moved this from Ready 🔔 to Done 💪🏾 in 🪴 Nebari Project Management Aug 28, 2024
@github-project-automation github-project-automation bot moved this from In Progress 🏃🏽‍♀️ to Done 💪🏾 in QHub Project Mangement 🚀 Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: discussion 💬 Needs discussion with the rest of the team project: JATIC Work item needed for the JATIC project type: enhancement 💅🏼 New feature or request
Projects
Development

No branches or pull requests

9 participants