Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero helm release marks backup "failed" even after backing up all items successfully #216

Closed
tapanhalani opened this issue Feb 11, 2021 · 6 comments

Comments

@tapanhalani
Copy link

What steps did you take and what happened:
I have installed velero helm chart in AKS cluster, and configured azure plugin with it for backup and volumeSnapshots. The backups start according to the schedule set, and according to velero logs, I can see all k8s objects are successfully backed-up, and snapshots of volumes are created succcessfully in Azure RG. But the backup object in k8s is itself marked as failed.

What did you expect to happen:
Since the backup completed successfully, I expect the backup object to be marked "successful". If there is any error in the backup process, then there should be a more clear error (or maybe I am missing something somewhere).

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

Here are final logs of velero that marks completion of the backup (and also subsequent failure due to some unclear reason):

time="2021-02-11T06:04:08Z" level=info msg="Setting up backup store to persist the backup" backup=sys--bak/velero-mybackup-20210211060259 logSource="pkg/controller/backup_controller.go:632"
time="2021-02-11T06:04:11Z" level=warning msg="The specified default backup location named \"default\" was not found; for convenience, be sure to create one or make another backup location that is available the default" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:161"
time="2021-02-11T06:04:11Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58"
time="2021-02-11T06:04:11Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120"
time="2021-02-11T06:04:11Z" level=warning msg="The specified default backup location named \"default\" was not found; for convenience, be sure to create one or make another backup location that is available the default" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:161"
time="2021-02-11T06:04:14Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:642"
time="2021-02-11T06:04:14Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=sys--bak/velero-mybackup-20210211060259 logSource="pkg/controller/backup_controller.go:279"
time="2021-02-11T06:04:26Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58"
time="2021-02-11T06:04:26Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120"

Here is the output of kubectl describe backup command (just the spec and status sections):

Spec:
  Default Volumes To Restic:  false
  Hooks:
  Storage Location:  azure
  Ttl:               240h0m0s
  Volume Snapshot Locations:
    azure
Status:
  Completion Timestamp:  2021-02-11T06:04:08Z
  Expiration:            2021-02-21T06:03:00Z
  Format Version:        1.1.0
  Phase:                 Failed
  Progress:
    Items Backed Up:           817
    Total Items:               817
  Start Timestamp:             2021-02-11T06:03:00Z
  Version:                     1
  Volume Snapshots Attempted:  1
  Volume Snapshots Completed:  1
Events:                        <none>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • helm version (use helm version): 3.5.2
  • helm chart version and app version:
    • helm3 (use helm list -n <YOUR NAMESPACE>): velero 2.14.8
  • Kubernetes version (use kubectl version): 1.19.0
  • Kubernetes installer & version: 1.19.3
  • Cloud provider or hardware configuration: Azure (AKS cluster)
@jenting
Copy link
Collaborator

jenting commented Feb 17, 2021

@tapanhalani
From the log, the problem is Velero server can't found the default backup storage location. Please consider renaming the existed backup storage location "azure" to "default".

Also, please make sure the Velero CRD manifests is up-to-date by running

velero install --crds-only --dry-run -o yaml | kubectl apply -f -

since the helm upgrade won't help you upgrade CRD manifests.

@tapanhalani
Copy link
Author

@jenting Thank you for your response. But the issue persists even after changing the name from "azure" to "default". Also, the log "Checking for existing backup locations ready to be verified" is an info log, whereas the "Backup failed" log is an error log, which seems to me like the former log is expected periodically (maybe to see if any new backup location is updated, and needs to be verified), whereas the latter log is an error.

I tried to debug using "debug" level logs, but no additional information exists on the reason of this "backup failed" error:

time="2021-02-18T12:07:34Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58"
time="2021-02-18T12:07:34Z" level=debug msg="Backup location not ready to be validated" backupstoragelocation=default controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:81"
time="2021-02-18T12:07:34Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120"
time="2021-02-18T12:07:35Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:642"
time="2021-02-18T12:07:35Z" level=debug msg="plugin process exited" backup=sys--bak/velero-mybackup-20210218120607 cmd=/velero logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/velero pid=38
time="2021-02-18T12:07:35Z" level=debug msg="plugin exited" backup=sys--bak/velero-mybackup-20210218120607 cmd=/velero logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74"
time="2021-02-18T12:07:35Z" level=debug msg="plugin process exited" backup=sys--bak/velero-mybackup-20210218120607 cmd=/plugins/velero-plugin-for-microsoft-azure logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/plugins/velero-plugin-for-microsoft-azure pid=88
time="2021-02-18T12:07:35Z" level=debug msg="plugin exited" backup=sys--bak/velero-mybackup-20210218120607 cmd=/plugins/velero-plugin-for-microsoft-azure logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74"
time="2021-02-18T12:07:35Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=sys--bak/velero-mybackup-20210218120607 logSource="pkg/controller/backup_controller.go:279"
time="2021-02-18T12:07:35Z" level=debug msg="Updating backup's final status" controller=backup key=sys--bak/velero-mybackup-20210218120607 logSource="pkg/controller/backup_controller.go:294"
time="2021-02-18T12:07:35Z" level=debug msg="Backup has not expired yet, skipping" backup=sys--bak/velero-mybackup-20210218120607 controller=gc-controller expiration="2021-02-28 12:06:07 +0000 UTC" logSource="pkg/controller/gc_controller.go:131"
time="2021-02-18T12:07:54Z" level=debug msg="Checking for existing backup storage locations to sync into cluster" controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:124"
time="2021-02-18T12:07:54Z" level=debug msg="Checking if backups need to be synced at this time for this location" backupLocation=default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:157"
time="2021-02-18T12:08:07Z" level=debug msg="Running processSchedule" controller=schedule key=sys--bak/velero-mybackup logSource="pkg/controller/schedule_controller.go:139"

Regarding CRDs, I deleted and re-installed them as a part of helm release itself, using the flag --set installCRDs=true during helm install. I can see the following CRDs installed in the cluster:

kubectl get crds | grep velero
backups.velero.io                             2021-02-18T12:01:47Z
backupstoragelocations.velero.io              2021-02-18T12:01:48Z
deletebackuprequests.velero.io                2021-02-18T12:01:48Z
downloadrequests.velero.io                    2021-02-18T12:01:48Z
podvolumebackups.velero.io                    2021-02-18T12:01:48Z
podvolumerestores.velero.io                   2021-02-18T12:01:49Z
resticrepositories.velero.io                  2021-02-18T12:01:49Z
restores.velero.io                            2021-02-18T12:01:49Z
schedules.velero.io                           2021-02-18T12:01:50Z
serverstatusrequests.velero.io                2021-02-18T12:01:50Z
volumesnapshotlocations.velero.io             2021-02-18T12:01:50Z

I think this process should install all the up-to-date CRDs.

@jenting
Copy link
Collaborator

jenting commented Feb 18, 2021

Could you please run velero backup-location get to check your backup storage location is accessible from Velero server?

@tapanhalani
Copy link
Author

@jenting Here is the output of the above command:

NAME      PROVIDER   BUCKET/PREFIX         PHASE       LAST VALIDATED                  ACCESS MODE
default   azure      aks201cor5prd1-bak1   Available   2021-02-18 12:31:24 +0000 UTC   ReadWrite

On another note, I had some backups (with status: failed, but done to completion) with version 2.11.0 of helm chart already existing in the cluster. When I upgraded the release to 2.14.8, I had to re-install CRDs along with it, which deleted those backups from k8s cluster (though not from the azure storage account and Azure volume-snapshot RG, which still had the k8s backups and volume snapshots). So when velero-2.14.8 started, it found those pre-existing backups, and tried to sync them with k8s cluster, but got the following error:

time="2021-02-18T12:25:05Z" level=info msg="Found 4 backups in the backup location that do not exist in the cluster and need to be synced" backupLocation=default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:197"
time="2021-02-18T12:25:05Z" level=info msg="Attempting to sync backup into cluster" backup=velero-mybackup-20210218050043 backupLocation=default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:205"
time="2021-02-18T12:25:06Z" level=error msg="Error getting backup metadata from backup store" backup=velero-mybackup-20210218050043 backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = storage: service returned error: StatusCode=404, ErrorCode=BlobNotFound, ErrorMessage=The specified blob does not exist.\nRequestId:d457e464-501e-0017-71f1-05e95d000000\nTime:2021-02-18T12:25:06.0563838Z, RequestInitiated=Thu, 18 Feb 2021 12:25:05 GMT, RequestId=d457e464-501e-0017-71f1-05e95d000000, API Version=2018-03-28, QueryParameterName=, QueryParameterValue=" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/persistence/object_store.go:261" error.function="github.com/vmware-tanzu/velero/pkg/persistence.(*objectBackupStore).GetBackupMetadata" logSource="pkg/controller/backup_sync_controller.go:209"

Upon checking in Azure portal, the blob "velero-mybackup-20210218050043" exists in the storage account configured in Azure velero plugin, but still it seems not accessible from Velero. Is this indicative of any permissions issue in Azure ?

@tapanhalani
Copy link
Author

@jenting I have managed to resolve the above issue after upgrading the resource limits of my velero instance from 50mcpu/200Mi memory to 1cpu/2Gi memory, as per findings in this issue: vmware-tanzu/velero#1856 .

@jenting
Copy link
Collaborator

jenting commented Feb 25, 2021

Thank u @tapanhalani, good to know.

@jenting jenting closed this as completed Feb 25, 2021
ndegory pushed a commit to ndegory/helm-charts that referenced this issue Jul 12, 2021
* add: kafka-exporter chart

Signed-off-by: gkarthiks <[email protected]>

* fix: artifacthub annotations

Signed-off-by: gkarthiks <[email protected]>

* feat: adding the tls option

Signed-off-by: gkarthiks <[email protected]>

* fix: default service monitor to false

Signed-off-by: gkarthiks <[email protected]>

* fix: default probes to false

Signed-off-by: gkarthiks <[email protected]>
ndegory pushed a commit to ndegory/helm-charts that referenced this issue Jul 12, 2021
* add: kafka-exporter chart

Signed-off-by: gkarthiks <[email protected]>

* fix: artifacthub annotations

Signed-off-by: gkarthiks <[email protected]>

* feat: adding the tls option

Signed-off-by: gkarthiks <[email protected]>

* fix: default service monitor to false

Signed-off-by: gkarthiks <[email protected]>

* fix: default probes to false

Signed-off-by: gkarthiks <[email protected]>
Signed-off-by: Torsten Walter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants