Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Values Not Adhered To ? #6879

Closed
pseymournutanix opened this issue Sep 27, 2023 · 14 comments
Closed

Timeout Values Not Adhered To ? #6879

pseymournutanix opened this issue Sep 27, 2023 · 14 comments
Assignees
Labels
area/fs-backup Candidate for close Issues that should be closed and need a team review before closing

Comments

@pseymournutanix
Copy link

Have a backup of a single namespace with a large volume using v1.12.

Have the values set in the deployment as

        - server
        - --uploader-type=restic
        - --fs-backup-timeout=8h
        - --client-burst=100
        - --client-qps=75
        - --default-volumes-to-fs-backup

But after just over 2 hours the backup failed with what looks like a timeout

error: signal: killed stderr: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:306" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=eulabeia-99c45f965-d4wqj

bundle-2023-09-27-08-41-43.tar.gz

@pseymournutanix
Copy link
Author

The data in the volume is quite dynamic I believe from the service owners if that is an issue at all.

@sseago
Copy link
Collaborator

sseago commented Sep 27, 2023

Since the error wasn't a "timed out waiting for..." error, I don't think this is a case of the timeout not being honored. It failed for different reasons before hitting timeout. Did the node agent processing the podvolumebackup restart?

@blackpiglet
Copy link
Contributor

There is already a thread in the Slack channel to discuss this topic: https://kubernetes.slack.com/archives/C6VCGP4MT/p1695800059893369

Looks like the Restic backup command is killed. Usually, this is due to OOM.
Please check the node-agent pod status.

@blackpiglet
Copy link
Contributor

There are two PVC backup failed with signal: killed. The following are the two PVC and Pod information. Please check whether there is any difference from other volumes.

                "tags": {
                    "backup": "eulabeia-corp-daily-20230927040036",
                    "backup-uid": "88924f8c-c3a9-483b-8821-e1b3d37a7439",
                    "ns": "eulabeia",
                    "pod": "eulabeia-99c45f965-d4wqj",
                    "pod-uid": "60d40e71-eff0-4918-a89c-6a09bf5a46be",
                    "pvc-uid": "e4aa2083-1ed3-4517-977c-c33b3335988c",
                    "volume": "persistent"
                },
                "tags": {
                    "backup": "dre-services-corp-daily-20230926065521",
                    "backup-uid": "85a0d152-4470-4c06-8985-1a4fb553ea95",
                    "ns": "eulabeia",
                    "pod": "eulabeia-99c45f965-d4wqj",
                    "pod-uid": "60d40e71-eff0-4918-a89c-6a09bf5a46be",
                    "pvc-uid": "e4aa2083-1ed3-4517-977c-c33b3335988c",
                    "volume": "persistent"
                },

One way is to avoid this error is enlarging the resources used by the node-agent.
Another way is to use Kopia as the uploader instread of Restic. Kopia's resource untilization is better in our performance test.

@blackpiglet
Copy link
Contributor

The data in the volume is quite dynamic I believe from the service owners if that is an issue at all.

Do you mean the data in the volume is constantly changing?
If so, filesystem backup may not be a good choice for this scenario, although the failures is not triggered by this problem. In filesystem backup, the uploader will go through all data in the volumes, and have a overall metadata for the data, then uploader will try to upload data according to the metadata. If the files or directories recorded in the metadata cannot be found in the filesystem, that will also fail the backup.

If it's possible, it's better to use snapshot to back volume up, such as CSI plugin and Velero native volume snapshotter.
If the environment doesn't support that, use backup hook to freeze the filesystem during backup can also solve the problem. https://velero.io/docs/v1.12/backup-hooks/

@pseymournutanix
Copy link
Author

Thank you.

The pods didn't appear to be OOM killed as none registered restarts with restic tried switching to kopia and the backups failed and the node-agent pod was certainly OOM killed this time.

Due to the current limitations of the Nutanix CSI driver and snapshots I can create snapshots on the same Nutanix cluster. Which is a work-around but still has risks. So I am performing that and am still trying to get a FS backup to object store at a lower frequency, so still working on it.

@draghuram
Copy link
Contributor

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

@blackpiglet
Copy link
Contributor

blackpiglet commented Sep 29, 2023

If you are using Kopia as the node-agent, and the node-agent pods got killed by OOM, you may consider enlarging the memory setting of the node-agent.

Another thing is the snapshot data mover is introduced in Velero v1.12, which may address your need. https://velero.io/docs/v1.12/csi-snapshot-data-movement/

@blackpiglet blackpiglet self-assigned this Sep 29, 2023
@pseymournutanix
Copy link
Author

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

Yes I have thank you :)

@pseymournutanix
Copy link
Author

At present on our test system I have changed to "kopia" and set the limits as 4gb for the node-agent pods.

I have created a snapshot backup, and a FS backup to another Nutanix cluster object store.

The snapshot data movement looks very useful I will check into that and see if I can utilise it.

Thanks for all your help. We find out more over the weekend.

@draghuram
Copy link
Contributor

Hi @pseymournutanix, Since you mentioned Nutanix, have you updated the kubelet hostPath to "/var/nutanix/var/lib/kubelet" before doing Restic backup? You might have done so but I wanted to confirm.

Yes I have thank you :)

Thanks for confirming. I opened #6902 to update docs.

@pseymournutanix
Copy link
Author

What is concerning is that a snapshot backup completes successfully

CSI Volume Snapshots:
Snapshot Content Name: snapcontent-0dcb80e2-8191-44b0-978d-8fd7e4d8b742
  Storage Snapshot ID: NutanixVolumes-ed029a2f-25ec-4a6b-a552-815a78f5f1a9
  Snapshot Size (bytes): 8589934592
  Ready to use: true
...

But even though the k8s cluster has volumesnapshots objects during the backup after the backup they disappear.

I see these messages on the Nutanix console

Screenshot 2023-09-30 at 09 56 52

But there is no protection domain and I can see no record of the snapshots existing anywhere ?

@blackpiglet
Copy link
Contributor

@pseymournutanix
Do you mean the VolumeSnapshot created during the Velero CSI backup is deleted after the backup is completed?

This is the expected behavior. The reason to do this is that the VolumeSnapshot is a namespace-scoped resource. When the namespace is deleted, it's possible the snapshot related to VolumeSnapshot will be deleted too. To avoid this, the VolumeSnapshot is cleaned during the backup creation process.

You can find the snapshot created during the backup by the following two ways:

  • Run velero backup describe <backup-name> --details command. It will show the CSI snapshot information.
  • Go to the object storage bucket backup's directory. There is a metadata file called <backup-name>-csi-volumesnapshots.json.gz.

@blackpiglet blackpiglet added the Candidate for close Issues that should be closed and need a team review before closing label Oct 1, 2023
@pseymournutanix
Copy link
Author

OK thanks that explains it.

With the kopia based filesystem backups I am seeing the pods go to 2.8gb memory which is now covered but looks like I just ran into - #6880

  Velero:   name: /eulabeia-5d459bb656-d5qcb error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing mount/main.tgz: ConcatenateObjects is not supported
Error when processing mount/main/.git/objects/pack/pack-83e392568146df3ee655253fa7ecd6dc7aa2ebb9.pack: ConcatenateObjects is not supported
Error when processing mount/random/main/.git/objects/pack/pack-37c750534ce73718ae1ffdc51aa0148c1d168128.pack: ConcatenateObjects is not supported
     name: /mongodb-dre-0 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2-4465263979934429276.wt: ConcatenateObjects is not supported
Error when processing index-3-4465263979934429276.wt: ConcatenateObjects is not supported
Error when processing index-4-4465263979934429276.wt: ConcatenateObjects is not supported
     name: /mongodb-dre-1 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2--6520432799943930152.wt: ConcatenateObjects is not supported
Error when processing index-3--6520432799943930152.wt: ConcatenateObjects is not supported
Error when processing index-4--6520432799943930152.wt: ConcatenateObjects is not supported
     name: /mongodb-dre-2 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing collection-2--373633045728996489.wt: ConcatenateObjects is not supported
Error when processing index-3--373633045728996489.wt: ConcatenateObjects is not supported
Error when processing index-4--373633045728996489.wt: ConcatenateObjects is not supported

I see a change/fix has been merged. Hopefully 1.12.1 will address this.

Thank you so much for the help. Really appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/fs-backup Candidate for close Issues that should be closed and need a team review before closing
Projects
None yet
Development

No branches or pull requests

4 participants