-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout Values Not Adhered To ? #6879
Comments
The data in the volume is quite dynamic I believe from the service owners if that is an issue at all. |
Since the error wasn't a "timed out waiting for..." error, I don't think this is a case of the timeout not being honored. It failed for different reasons before hitting timeout. Did the node agent processing the podvolumebackup restart? |
There is already a thread in the Slack channel to discuss this topic: https://kubernetes.slack.com/archives/C6VCGP4MT/p1695800059893369 Looks like the Restic backup command is killed. Usually, this is due to OOM. |
There are two PVC backup failed with
One way is to avoid this error is enlarging the resources used by the node-agent. |
Do you mean the data in the volume is constantly changing? If it's possible, it's better to use snapshot to back volume up, such as CSI plugin and Velero native volume snapshotter. |
Thank you. The pods didn't appear to be OOM killed as none registered restarts with Due to the current limitations of the Nutanix CSI driver and snapshots I can create snapshots on the same Nutanix cluster. Which is a work-around but still has risks. So I am performing that and am still trying to get a FS backup to object store at a lower frequency, so still working on it. |
Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm. |
If you are using Kopia as the node-agent, and the node-agent pods got killed by OOM, you may consider enlarging the memory setting of the node-agent. Another thing is the snapshot data mover is introduced in Velero v1.12, which may address your need. https://velero.io/docs/v1.12/csi-snapshot-data-movement/ |
Yes I have thank you :) |
At present on our test system I have changed to "kopia" and set the limits as 4gb for the I have created a snapshot backup, and a FS backup to another Nutanix cluster object store. The snapshot data movement looks very useful I will check into that and see if I can utilise it. Thanks for all your help. We find out more over the weekend. |
Thanks for confirming. I opened #6902 to update docs. |
What is concerning is that a snapshot backup completes successfully
But even though the k8s cluster has I see these messages on the Nutanix console But there is no protection domain and I can see no record of the snapshots existing anywhere ? |
@pseymournutanix This is the expected behavior. The reason to do this is that the VolumeSnapshot is a namespace-scoped resource. When the namespace is deleted, it's possible the snapshot related to VolumeSnapshot will be deleted too. To avoid this, the VolumeSnapshot is cleaned during the backup creation process. You can find the snapshot created during the backup by the following two ways:
|
OK thanks that explains it. With the
I see a change/fix has been merged. Hopefully 1.12.1 will address this. Thank you so much for the help. Really appreciated. |
Have a backup of a single namespace with a large volume using v1.12.
Have the values set in the deployment as
But after just over 2 hours the backup failed with what looks like a timeout
bundle-2023-09-27-08-41-43.tar.gz
The text was updated successfully, but these errors were encountered: