Timeout Values Not Adhered To ? #6879

pseymournutanix · 2023-09-27T07:44:22Z

Have a backup of a single namespace with a large volume using v1.12.

Have the values set in the deployment as

        - server
        - --uploader-type=restic
        - --fs-backup-timeout=8h
        - --client-burst=100
        - --client-qps=75
        - --default-volumes-to-fs-backup

But after just over 2 hours the backup failed with what looks like a timeout

error: signal: killed stderr: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:306" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=eulabeia-99c45f965-d4wqj

bundle-2023-09-27-08-41-43.tar.gz

The text was updated successfully, but these errors were encountered:

pseymournutanix · 2023-09-27T07:44:59Z

The data in the volume is quite dynamic I believe from the service owners if that is an issue at all.

sseago · 2023-09-27T21:56:18Z

Since the error wasn't a "timed out waiting for..." error, I don't think this is a case of the timeout not being honored. It failed for different reasons before hitting timeout. Did the node agent processing the podvolumebackup restart?

blackpiglet · 2023-09-28T06:02:06Z

There is already a thread in the Slack channel to discuss this topic: https://kubernetes.slack.com/archives/C6VCGP4MT/p1695800059893369

Looks like the Restic backup command is killed. Usually, this is due to OOM.
Please check the node-agent pod status.

blackpiglet · 2023-09-28T06:16:35Z

There are two PVC backup failed with signal: killed. The following are the two PVC and Pod information. Please check whether there is any difference from other volumes.

                "tags": {
                    "backup": "eulabeia-corp-daily-20230927040036",
                    "backup-uid": "88924f8c-c3a9-483b-8821-e1b3d37a7439",
                    "ns": "eulabeia",
                    "pod": "eulabeia-99c45f965-d4wqj",
                    "pod-uid": "60d40e71-eff0-4918-a89c-6a09bf5a46be",
                    "pvc-uid": "e4aa2083-1ed3-4517-977c-c33b3335988c",
                    "volume": "persistent"
                },

                "tags": {
                    "backup": "dre-services-corp-daily-20230926065521",
                    "backup-uid": "85a0d152-4470-4c06-8985-1a4fb553ea95",
                    "ns": "eulabeia",
                    "pod": "eulabeia-99c45f965-d4wqj",
                    "pod-uid": "60d40e71-eff0-4918-a89c-6a09bf5a46be",
                    "pvc-uid": "e4aa2083-1ed3-4517-977c-c33b3335988c",
                    "volume": "persistent"
                },

One way is to avoid this error is enlarging the resources used by the node-agent.
Another way is to use Kopia as the uploader instread of Restic. Kopia's resource untilization is better in our performance test.

blackpiglet · 2023-09-28T06:33:22Z

The data in the volume is quite dynamic I believe from the service owners if that is an issue at all.

Do you mean the data in the volume is constantly changing?
If so, filesystem backup may not be a good choice for this scenario, although the failures is not triggered by this problem. In filesystem backup, the uploader will go through all data in the volumes, and have a overall metadata for the data, then uploader will try to upload data according to the metadata. If the files or directories recorded in the metadata cannot be found in the filesystem, that will also fail the backup.

If it's possible, it's better to use snapshot to back volume up, such as CSI plugin and Velero native volume snapshotter.
If the environment doesn't support that, use backup hook to freeze the filesystem during backup can also solve the problem. https://velero.io/docs/v1.12/backup-hooks/

pseymournutanix · 2023-09-28T13:21:43Z

Thank you.

The pods didn't appear to be OOM killed as none registered restarts with restic tried switching to kopia and the backups failed and the node-agent pod was certainly OOM killed this time.

Due to the current limitations of the Nutanix CSI driver and snapshots I can create snapshots on the same Nutanix cluster. Which is a work-around but still has risks. So I am performing that and am still trying to get a FS backup to object store at a lower frequency, so still working on it.

draghuram · 2023-09-29T03:10:46Z

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

blackpiglet · 2023-09-29T03:24:39Z

If you are using Kopia as the node-agent, and the node-agent pods got killed by OOM, you may consider enlarging the memory setting of the node-agent.

Another thing is the snapshot data mover is introduced in Velero v1.12, which may address your need. https://velero.io/docs/v1.12/csi-snapshot-data-movement/

pseymournutanix · 2023-09-29T15:36:53Z

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

Yes I have thank you :)

pseymournutanix · 2023-09-29T15:38:31Z

At present on our test system I have changed to "kopia" and set the limits as 4gb for the node-agent pods.

I have created a snapshot backup, and a FS backup to another Nutanix cluster object store.

The snapshot data movement looks very useful I will check into that and see if I can utilise it.

Thanks for all your help. We find out more over the weekend.

draghuram · 2023-09-29T19:40:32Z

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

Yes I have thank you :)

Thanks for confirming. I opened #6902 to update docs.

pseymournutanix · 2023-09-30T08:59:23Z

What is concerning is that a snapshot backup completes successfully

CSI Volume Snapshots:
Snapshot Content Name: snapcontent-0dcb80e2-8191-44b0-978d-8fd7e4d8b742
  Storage Snapshot ID: NutanixVolumes-ed029a2f-25ec-4a6b-a552-815a78f5f1a9
  Snapshot Size (bytes): 8589934592
  Ready to use: true
...

But even though the k8s cluster has volumesnapshots objects during the backup after the backup they disappear.

I see these messages on the Nutanix console

But there is no protection domain and I can see no record of the snapshots existing anywhere ?

blackpiglet · 2023-10-01T04:03:19Z

@pseymournutanix
Do you mean the VolumeSnapshot created during the Velero CSI backup is deleted after the backup is completed?

This is the expected behavior. The reason to do this is that the VolumeSnapshot is a namespace-scoped resource. When the namespace is deleted, it's possible the snapshot related to VolumeSnapshot will be deleted too. To avoid this, the VolumeSnapshot is cleaned during the backup creation process.

You can find the snapshot created during the backup by the following two ways:

Run velero backup describe <backup-name> --details command. It will show the CSI snapshot information.
Go to the object storage bucket backup's directory. There is a metadata file called <backup-name>-csi-volumesnapshots.json.gz.

pseymournutanix · 2023-10-01T07:37:47Z

OK thanks that explains it.

With the kopia based filesystem backups I am seeing the pods go to 2.8gb memory which is now covered but looks like I just ran into - #6880

  Velero:   name: /eulabeia-5d459bb656-d5qcb error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing mount/main.tgz: ConcatenateObjects is not supported
Error when processing mount/main/.git/objects/pack/pack-83e392568146df3ee655253fa7ecd6dc7aa2ebb9.pack: ConcatenateObjects is not supported
Error when processing mount/random/main/.git/objects/pack/pack-37c750534ce73718ae1ffdc51aa0148c1d168128.pack: ConcatenateObjects is not supported
     name: /mongodb-dre-0 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2-4465263979934429276.wt: ConcatenateObjects is not supported
Error when processing index-3-4465263979934429276.wt: ConcatenateObjects is not supported
Error when processing index-4-4465263979934429276.wt: ConcatenateObjects is not supported
     name: /mongodb-dre-1 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2--6520432799943930152.wt: ConcatenateObjects is not supported
Error when processing index-3--6520432799943930152.wt: ConcatenateObjects is not supported
Error when processing index-4--6520432799943930152.wt: ConcatenateObjects is not supported
     name: /mongodb-dre-2 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2--373633045728996489.wt: ConcatenateObjects is not supported
Error when processing index-3--373633045728996489.wt: ConcatenateObjects is not supported
Error when processing index-4--373633045728996489.wt: ConcatenateObjects is not supported

I see a change/fix has been merged. Hopefully 1.12.1 will address this.

Thank you so much for the help. Really appreciated.

blackpiglet added area/fs-backup Performance and removed Performance labels Sep 28, 2023

blackpiglet self-assigned this Sep 29, 2023

blackpiglet added the Candidate for close Issues that should be closed and need a team review before closing label Oct 1, 2023

blackpiglet closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Values Not Adhered To ? #6879

Timeout Values Not Adhered To ? #6879

pseymournutanix commented Sep 27, 2023

pseymournutanix commented Sep 27, 2023

sseago commented Sep 27, 2023

blackpiglet commented Sep 28, 2023

blackpiglet commented Sep 28, 2023

blackpiglet commented Sep 28, 2023

pseymournutanix commented Sep 28, 2023

draghuram commented Sep 29, 2023

blackpiglet commented Sep 29, 2023 •

edited

Loading

pseymournutanix commented Sep 29, 2023

pseymournutanix commented Sep 29, 2023

draghuram commented Sep 29, 2023

pseymournutanix commented Sep 30, 2023

blackpiglet commented Oct 1, 2023

pseymournutanix commented Oct 1, 2023

Timeout Values Not Adhered To ? #6879

Timeout Values Not Adhered To ? #6879

Comments

pseymournutanix commented Sep 27, 2023

pseymournutanix commented Sep 27, 2023

sseago commented Sep 27, 2023

blackpiglet commented Sep 28, 2023

blackpiglet commented Sep 28, 2023

blackpiglet commented Sep 28, 2023

pseymournutanix commented Sep 28, 2023

draghuram commented Sep 29, 2023

blackpiglet commented Sep 29, 2023 • edited Loading

pseymournutanix commented Sep 29, 2023

pseymournutanix commented Sep 29, 2023

draghuram commented Sep 29, 2023

pseymournutanix commented Sep 30, 2023

blackpiglet commented Oct 1, 2023

pseymournutanix commented Oct 1, 2023

blackpiglet commented Sep 29, 2023 •

edited

Loading