Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to change or map StorageClass during backup using CSI Snapshots and DataMover #7700

Closed
fgleixner opened this issue Apr 17, 2024 · 8 comments

Comments

@fgleixner
Copy link

Describe the problem/challenge you have
We use longhorn and we have different StorageClasses for longhorn defined. Some for SSD/NVMe Disks, some on rotating rust, some with 1, 2 or even 3 Replicas. For different Workloads.
When we do backups, we noticed, that the PVC generated from the CSI snapshot uses the same StorageClass as the original volume.
So a PVC and a PV is created only for backup purposes and it inherits the settings of the original PV which may be expensive NVMe storage with 3 replicas. This may cause the backup to fail, because the amount of this specific storage may not be available.

Describe the solution you'd like
I'd like to see a possibility to map Storageclasses the same way it is possible for restores also during snapshot data movement.
https://velero.io/docs/v1.13/restore-reference/#changing-pvpvc-storage-classes

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.12.1
  • Kubernetes version (use kubectl version): 1.28
  • Kubernetes installer & version: kubesprax 2.24.1
  • Cloud provider or hardware configuration: on premises, Longhorn for storage
  • OS (e.g. from /etc/os-release): SLES 15

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@sseago
Copy link
Collaborator

sseago commented Apr 17, 2024

To make sure I understand this, you're saying that the storageclass mapping on backup here would only be used for creating the temporary cloned-via-snapshot PVC that velero uses for copying data to object storage, but the storageclass stored in the backup on the PVC to be restored is unchanged? Of course if the storageclass you want to use for the temporary PVC used for data movement is the same as the one you want to restore to, you'd just use the same mapping when creating backup and restore.

Or did you have something different in mind?

@fgleixner
Copy link
Author

Exactly. The mapping structure in the config map could be the same and re-used, but for different purposes. This should not change anything else but only the StorageClass of the temporary PVC for copying data to object storage.

@Rohmilchkaese
Copy link

Thanks for opening @fgleixner we discovered the same thing! Also with longhorn as storage backend

@gh-tek
Copy link

gh-tek commented Jun 18, 2024

I have this problem too with longhorn. Temporary PVC for data upload creates huge disk I/O load because it creates replicated PV with data locality requirement. I created another storage class for data upload, but then realized that there is no place to configure it. Velero data upload seems to always use whatever storage class that original pv is.

@ehemmerlin
Copy link

ehemmerlin commented Jun 21, 2024

We are facing the same issue: the backup uses a lot of disk space, as Longhorn replicates the data of each snapshot the same way it does for all of our Kubernetes volumes. Moreover it retains these volumes, as it's the default reclaim policy of our storage class, so volumes created during backups are never deleted even if the backup has expired and no longer exists.

We are looking for a way to remove the volumes created by the snapshots after the backup has expired so specifying a different storage class for Velero's backup (we could create one with a Delete reclaim policy instead of a Retain one) would solve this issue: volumes from expired backups would be deleted.

Link to the original issue: #6192

Being able to change StorageClass during backup using CSI Snapshots and DataMover would allow us to set reclaimPolicy at delete and numberOfReplicas at 1, which would fix the entire issue we face.
image

@larssb
Copy link

larssb commented Jul 9, 2024

I have this problem too with longhorn. Temporary PVC for data upload creates huge disk I/O load because it creates replicated PV with data locality requirement. I created another storage class for data upload, but then realized that there is no place to configure it. Velero data upload seems to always use whatever storage class that original pv is.

I think exactly this may be the cause of us seeing a worker node going down/getting into a Zombie state over the weekend.

  • I upgraded Velero to v1.14
  • CSI Data Mover is enabled

Not long after we saw worker nodes get into NotReady ... because of high CPU usage on these. First on one worker, it recovered, and then another.

Then from Friday to Saturday another worker went haywire with a huge spike in CPU. Finally for this node to get into a total zombie state ... Pods in terminating state for a prolonged time ... not recovering after the CPU and MEM usage on the worker oozed down.


I think this issue is pretty important and thank you for looking at it.

Have a great day.

@Borrelhapje
Copy link

I think this can be closed as this feature will be released with the next minor release?

@Lyndon-Li Lyndon-Li reopened this Aug 27, 2024
@Lyndon-Li Lyndon-Li added this to the v1.15 milestone Aug 27, 2024
@Lyndon-Li
Copy link
Contributor

Fixed by #7982 and #8109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment