Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data mover might not work with local PVs/PVCs CSIs #7044

Closed
trunet opened this issue Nov 1, 2023 · 8 comments
Closed

Data mover might not work with local PVs/PVCs CSIs #7044

trunet opened this issue Nov 1, 2023 · 8 comments
Assignees

Comments

@trunet
Copy link

trunet commented Nov 1, 2023

What steps did you take and what happened:

Due to data mover being scheduled to a different node than the original PV/PVC, the pod stuck in Pending and doesn't upload.

I'm trying to use velero CSI to upload snapshots to a minio cluster. All my persistent volumes are backed by openebs/lvm-localpv.

At first, I was getting #6964 error, and using velero/velero:main containing #6976 fixed it.

I created my backup with the following:
velero backup create trunettest --include-namespaces redis --snapshot-move-data

Now, the problem is datamover pod was scheduled to a node that doesn't have the snapshot (remember, it's a local PV). Therefore it stuck in Pending for ~30 minutes now.

What did you expect to happen:

Snapshot should be uploading successfully to the backup location.

The following information will help us better understand what's going on:

bundle-2023-11-01-02-22-29.tar.gz

Anything else you would like to add:

Original PVC contains volume.kubernetes.io/selected-node annotation which can help data-mover pod set an affinity to the correct node:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: local.csi.openebs.io
    volume.kubernetes.io/selected-node: talos01
    volume.kubernetes.io/storage-provisioner: local.csi.openebs.io
  creationTimestamp: "2023-10-27T09:42:10Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/component: master
    app.kubernetes.io/instance: redis
    app.kubernetes.io/name: redis
  name: redis-data-redis-master-0
  namespace: redis
  resourceVersion: "142514752"
  uid: 11171069-19d9-4448-9a94-1ca0d28d5feb
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: openebs-lvm
  volumeMode: Filesystem
  volumeName: pvc-11171069-19d9-4448-9a94-1ca0d28d5feb
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  phase: Bound

Environment:

  • Velero version (use velero version):
Client:
        Version: v1.12.1
        Git commit: 5c4fdfe147357ec7b908339f4516cd96d6b97c61
Server:
        Version: main
# WARNING: the client version does not match the server version. Please update server
  • Velero features (use velero client config get features): features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3
  • Kubernetes installer & version: Talos v1.5.4
  • Cloud provider or hardware configuration: bare-metal
  • OS (e.g. from /etc/os-release): talos

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@trunet trunet changed the title Data mover might not work with local PVs/PVCs CSIs due to data mover being scheduled to a different node than the original PV/PVC Data mover might not work with local PVs/PVCs CSIs Nov 1, 2023
@Lyndon-Li Lyndon-Li self-assigned this Nov 1, 2023
@Lyndon-Li
Copy link
Contributor

I think this is another cases that why we require the blacklist mentioned in issue #7036.

@trunet Could you double confirm if the blacklist for data mover scheduler could solve your current problem? Or would you prefer a whitelist?

@trunet
Copy link
Author

trunet commented Nov 1, 2023

neither would work in my case, because it needs to bind to the same node as the PVC being snapshotted. if I have 3 PVCs in 3 different nodes, I need to allow list all of them but I’ll still have the problem mentioned on this ticket.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 1, 2023

I double checked the code for openebs/lvm-localpv, I don't think it supports restoring/cloning volume from CSI snapshot, see the code here, if the volume is from a snapshot, it returns Unimplemented.

Also see the features in its Readme, the Clone is not supported.

So may be you need to use openebs/zfs-localpv.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 1, 2023

Finally, I figured out that the current problem is not about that the data mover is running in a node where the snapshot doesn't exist/isn't accessible, but that the snapshot cannot be cloned as a volume at all since openebs/lvm-localpv doesn't support it.
As a result, the data mover cannot wait for the bound of the backup PVC until the 30 min timeout.

The snapshot location for local volumes doesn't have a problem:

  • The VS/VSC for a snapshot are available across all the cluster node
  • When a snapshot is cloned for a local volume, the provisioned volume are available from one node only, so the CSI driver provision the volume in the node and marks it in the PV's node affinity
  • When Kubernetes schedules a pod that uses the volume, the scheduler checks the PV's node affinity and find the preferred node to schedule the pod
  • Velero data mover leverages the same pod schedule mechanism to schedule data mover tasks to nodes
  • So data mover tasks are always scheduled to a node where the volume from snapshot is accessible

@Lyndon-Li
Copy link
Contributor

@trunet Since openebs/lvm-localpv doesn't support snapshot clone, I don't think you can use either CSI snapshot backup or CSI snapshot data movement backup to back up volumes provisioned by it. As mentioned above openebs/zfs-localpv may be an alternative for you.

@trunet
Copy link
Author

trunet commented Nov 1, 2023

ok, I’ll give it a shot, thanks.

@DommDe
Copy link

DommDe commented Aug 7, 2024

@Lyndon-Li If i get you right the problem with openebs/lvm-localpv is, that the data mover clones the snapshot into a new pv, which is then uploaded to the remote backup location. This is not possible with openebs/lvm-localpv, as they did not implement the clone feature. Now as i want to use openebs/lvm-localpv, but need backups i got an idea and i want to know, if this could work:

The snapshot created by openebs/lvm-localpv can be mounted on the host it is stored on. This way the snapshots can easily be copied over to a remote location. I mean it's easy to do it manually for a single snapshot or write a script to perform the steps needed. (Mount, Copy, Unmount)

Now my idea is to tell velero to use my own datamover called "openebs-lvm-mover" for example. This datamover could be implemented in form of a script running on the worker host looking for DataUpload resources that have "openebs-lvm-mover" set as the datamover and is related to a snapshot stored on that host. In this case it would start the script that mounts and copies the snapshot to a remote location. After the script finishes the DataUpload needs to be maked as finished, and the snapshot can be removed. Is this correct or did i miss something?

@Lyndon-Li
Copy link
Contributor

@DommDe
It is basically correct. For more details, you can refer to Velero Snapshot Data Movement Design where it explains how to write a 3rd data mover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants