-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup and Restore Implementation #743
Comments
Some good point raised during our meeting. The velero backup will only apply to block storage pv/pvcs deployed within kubernetes on the specific cloud providers. We do use efs from aws and this would be out of scope. Additionally I forsee the conversation around storing credentials within the qhub-config.yaml for s3 bucket access. For now assume that storing credentials within the bucket is okay. This is because other future PRs will solve storing secrets in the configuraition. |
Assigning: |
Some issues raised:
|
Looking at current alternatives to Valero:
tl;dr the market is very limited vis á vis FOSS alternatives to Velero. |
Wanted to document a solution I got working on prem via minikube start --driver=docker --kubernetes-version=v1.21.3 To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup. apiVersion: v1
kind: Service
metadata:
name: minio
spec:
type: NodePort
ports:
- name: "9000"
nodePort: 30900
port: 9000
targetPort: 9000
- name: "9001"
nodePort: 30901
port: 9001
targetPort: 9001
selector:
app: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: minio-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
labels:
app: minio
spec:
replicas: 1
selector:
matchLabels:
app: minio
strategy:
type: Recreate
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:RELEASE.2021-08-25T00-41-18Z
args:
- "-c"
- "mkdir -p /data/velero && /usr/bin/minio server /data --console-address 0.0.0.0:9001"
command:
- "sh"
env:
- name: MINIO_ACCESS_KEY
value: admin
- name: MINIO_SECRET_KEY
value: password
ports:
- containerPort: 9000
- containerPort: 9001
volumeMounts:
- mountPath: /data
name: minio-claim
restartPolicy: Always
volumes:
- name: minio-claim
persistentVolumeClaim:
claimName: minio-claim and then an example application to test the backup with apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pod-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: hellopod
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- "date >> /data/example.txt; sleep 100000"
volumeMounts:
- mountPath: /data
name: pod-claim
restartPolicy: OnFailure
volumes:
- name: pod-claim
persistentVolumeClaim:
claimName: pod-claim Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it. [default]
aws_access_key_id = admin
aws_secret_access_key = password And then we download velero wget https://github.com/vmware-tanzu/velero/releases/download/v1.6.3/velero-v1.6.3-linux-amd64.tar.gz
tar -xf *.tar.gz
cd velero-*
./velero install --provider=aws --plugins velero/velero-plugin-for-aws:v1.0.0 --use-restic --use-volume-snapshots=false --bucket=velero --secret-file /tmp/velero/credentials.txt --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.default.svc:9000 Finally lets demonstrate a backup ./velero backup create anexample --default-volumes-to-restic=true You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via This also looks like it will be able to backup efs and cloud specific pvcs 😄. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some. |
At a high-level it appears that a Velero + Restic backup will mostly likely work for our purposes. I started my testing on a minikube cluster but kept running into errors (they might still be user-errors) so I decided to repurpose an existing AWS deployment I was using; I had much better success backing up and restoring on the AWS QHub cluster (steps outlined below). There are still a handful of things to test and consider:
Steps
Obervations
|
The is verified working on AWS and GCP: BackupIn order to specify a volume for restic restoration, we need to annotate a pod with
To avoid errors on mounts that don't need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:
With this setup, velero can be installed with the
The backup is created with:
RestoreNote that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the
With these gone, the restore can be initiatied with:
Note that the restore will say that it partially failed. This is because there is already a symlink for |
To copy the backed up data for home and shared to the current working directory, run the following command:
This will prompt for a password, which will always be |
Now that we have Argo-Workflows enabled, we can run backup and restore workflows much more easily; with a few small updates, we can schedule backups as cron-workflows. This backup/restore solution also relies on Requirements and implementationHere is an example of how we might want to run backup and restores. We will need:
Resource detailsI created this image so I could test this proposed solution (more on the results below). However in the long-term, we would likely require an image to be built and pushed to our open registries for each of the cloud providers that we support (AWS, Azure, DO, GCP). Again, to test the feasibility of this solution, I created the following secrets: apiVersion: v1
kind: Secret
metadata:
name: google-application-credentials
namespace: dev
type: Opaque
data:
GOOGLE_APPLICATION_CREDENTIALS: ---
---
apiVersion: v1
kind: Secret
metadata:
name: google-project-id
namespace: dev
type: Opaque
data:
GOOGLE_PROJECT_ID: ---
---
apiVersion: v1
kind: Secret
metadata:
name: restic-repo
namespace: dev
type: Opaque
data:
RESTIC_REPOSITORY: ---
---
apiVersion: v1
kind: Secret
metadata:
name: restic-password
namespace: dev
type: Opaque
data:
RESTIC_PASSWORD: ---
And then the actual workflows themselves. Backup workflowapiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: backup-workflow
namespace: dev
spec:
entrypoint: backup
volumes:
- name: google-application-credentials
secret:
secretName: google-application-credentials
- name: nfs-volume
persistentVolumeClaim:
claimName: "jupyterhub-dev-share"
templates:
- name: backup
container:
# image I created above
image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
env:
- name: GOOGLE_PROJECT_ID
valueFrom:
secretKeyRef:
name: google-project-id
key: GOOGLE_PROJECT_ID
- name: RESTIC_REPOSITORY
valueFrom:
secretKeyRef:
name: restic-repo
key: RESTIC_REPOSITORY
- name: RESTIC_PASSWORD
valueFrom:
secretKeyRef:
name: restic-password
key: RESTIC_PASSWORD
volumeMounts:
- mountPath: "/var/secrets/google"
name: google-application-credentials
# mount the NFS drive
- mountPath: "/exports"
name: nfs-volume
command: [sh, -c]
args: ['
gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
restic init;
restic backup /exports
'] Restore workflowapiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: restore-workflow
namespace: dev
spec:
entrypoint: restore
volumes:
- name: google-application-credentials
secret:
secretName: google-application-credentials
- name: nfs-volume
persistentVolumeClaim:
claimName: "jupyterhub-dev-share"
templates:
- name: restore
container:
# again, same image I created above
image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
env:
- name: GOOGLE_PROJECT_ID
valueFrom:
secretKeyRef:
name: google-project-id
key: GOOGLE_PROJECT_ID
- name: RESTIC_REPOSITORY
valueFrom:
secretKeyRef:
name: restic-repo
key: RESTIC_REPOSITORY
- name: RESTIC_PASSWORD
valueFrom:
secretKeyRef:
name: restic-password
key: RESTIC_PASSWORD
volumeMounts:
- mountPath: "/var/secrets/google"
name: google-application-credentials
# mount the NFS drive
- mountPath: "/exports"
name: nfs-volume
command: [sh, -c]
args: ['
gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
restic restore latest --target=/
'] ResultsI have successfully tested this solution - including both the backup and restore steps - on a live cluster running on GCP with a backup residing on GCS 🎉 Notes and open itemsA few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the SOPS RFD and Vault RFD). Obviously this is just a POC and will need to converted to a Terraform script, that said, this solution looks very promising and I am curious what the rest of the team thinks. |
Just so we don't forget. A full backup and restore needs to also take into account keycloak and conda-store databases. |
With this in the works now there is pressure for me to develop a backup/restore solution for conda-store 🙂. |
Absolutely. Yes we need a proper secret storage before adopting something like this to avoid leaking secrets in the argo workflows. |
@iameskild yes this is something that should be included in the future roadmap. I'm going to suggest that we use the extension PR work to add a subcommand for I think that backup/restore is something that we will need to incrementally improve independent of nebari. Also would allow us to release more frequently. I see several iterations that we should aim for: What to backup:
Where to backup to:
Priority in my mind:
|
@costrouc creating these as subcommands makes a lot of sense! And to confirm, we will be relying on the kubernetes (and keycloak) python client directly and won't be using terraform? And would the NFS backup be a single tar.gz file? Perhaps we could look into restic again. The benefits include only backing up the diff so it would be very quick after the initial backup. As I've done elsewhere, we can perform backups for individual directories (users) so in the event of a failure during the restore, we can pick back up more reliably. |
We could also include a backup gitops workflow that runs on a daily scheduler. |
Will this include extension data such as mlflow? This will need to be tested on both AWS and AWS GovCloud for JATIC |
superseded by #2648 |
Summary
QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.
Proposed implementation
We realize this is a large issue and it will be most likely easiest to approach this problem in steps.
The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key
credentials
that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups.schedule
will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a
qhub backup
andqhub restore
command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.Initially we would like a simple
qhub deploy
andqhub restore
command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.Additionally there should be documentation added for the admin and dev guide.
Acceptance Criteria
qhub backup
should trigger a manual backup of the cluster with files being backed up to s3 bucketqhub restore
should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).Tasks to complete
Related to
The text was updated successfully, but these errors were encountered: