Backup fails with transport is closing #1856

suleymanakbas91 · 2019-09-09T10:09:01Z

What steps did you take and what happened:

We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:

time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"

Any idea why this happens and is there anything we can do to prevent this?

Anything else you would like to add:

Here is the backup file we use:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kyma-backup
  namespace: kyma-system
spec:
  includedNamespaces:
  - '*'
  includedResources:
  - '*'
  includeClusterResources: true
  storageLocation: default
  volumeSnapshotLocations: 
  - default

We just deploy this file to the cluster using kubectl apply -f.

Environment:

Velero version (use velero version): 1.0.0
Kubernetes version (use kubectl version): 1.13.9-gke.3
Kubernetes installer & version:
Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

prydonius · 2019-09-09T21:30:04Z

I believe the gRPC error comes from the trying to talk to a plugin. If you add the --log-level debug flag to your velero server Pod, we might be able to get more info about what's happening here.

prydonius · 2019-09-09T21:31:34Z

looks like the same issue as #481

suleymanakbas91 · 2019-09-10T14:14:20Z

It's strange because we are not using any plugins.

Here are the error logs after I set the log level as debug:

$ kubectl logs backup-75d69b8644-fjlz7 -n kyma-system | grep error
time="2019-09-10T14:03:28Z" level=error msg="reading plugin stderr" cmd=/velero controller=backup-sync error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=error msg="reading plugin stderr" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=debug msg="plugin process exited" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="signal: killed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/velero pid=136
Failed to fire hook: object logged as error does not satisfy error interface
time="2019-09-10T14:03:36Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 logSource="pkg/controller/backup_controller.go:230"

Full pod logs

prydonius · 2019-09-10T18:32:03Z

@suleymanakbas91 good to know you don't have any external plugins, but Velero does include a set of default plugins so this is coming from those.

cc @skriss @nrb @carlisia any ideas?

skriss · 2019-09-10T19:27:04Z

@suleymanakbas91 do you have CPU/mem requests/limits defined for your Velero deployment?

suleymanakbas91 · 2019-09-10T19:41:17Z

@skriss we don't specify any. You can also check out the helm chart we use from here: https://github.com/kyma-project/kyma/tree/master/resources/backup

Crevil · 2019-09-19T20:50:47Z

We have run in to this as well and I just tried removing CPU/mem limits for the deployment and it looks much more stable.

prydonius · 2019-09-19T21:01:24Z

@Crevil @suleymanakbas91 are you able to see how much CPU/mem the Velero Pod is using?

Crevil · 2019-09-20T06:44:54Z

Sure thing @prydonius

We run backups every hour as seen by the spikes.

prydonius · 2019-09-20T23:38:59Z

@Crevil thanks for that! Were you previously using the default requests/limits provided by velero install? Your usage definitely goes over those defaults, could you tell us a little more about what you're including in your backups (e.g. no. of resources, whether you're using restic or not, and volume sizes if using restic)? Just trying to gauge if it makes sense to increase the default req/limits.

Crevil · 2019-09-21T06:22:50Z

I’ll ger back to you on monday with more details.

On Sat, 21 Sep 2019 at 01.39, Adnan Abdulhussein ***@***.***> wrote: @Crevil <https://github.com/Crevil> thanks for that! Were you previously using the default requests/limits provided by velero install? Your usage definitely goes over those defaults, could you tell us a little more about what you're including in your backups (e.g. no. of resources, whether you're using restic or not, and volume sizes if using restic)? Just trying to gauge if it makes sense to increase the default req/limits. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1856>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUQDHV5M7Q64CLLDI4CMFDQKVNJRANCNFSM4IUZH2AA> .

-- Hilsen Bjørn Sørensen Tlf.: (+45) 28447177

Crevil · 2019-09-23T11:02:45Z

We were using limits as specified from the install command. Here is a shortened version of what we were running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: velero
spec:
  template:
    spec:
      containers:
      - name: velero
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi

We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots.
We are not using restic.

Let me know if you need any thing else.

prydonius · 2019-09-23T21:47:17Z

Really appreciate the info @Crevil. Strange, your backups are smaller than this user's, though their resource usage was lower.

It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits?

@suleymanakbas91 are you still experiencing this issue?

Crevil · 2019-09-24T07:06:39Z

It looks stable now, yes. We’ll set up an alert on the consumption to be warned in the future.

On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein ***@***.***> wrote: Really appreciate the info @Crevil <https://github.com/Crevil>. Strange, your backups are smaller than this user's <#94 (comment)>, though their resource usage was lower. It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits? @suleymanakbas91 <https://github.com/suleymanakbas91> are you still experiencing this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1856>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUQDHQXXTW7JEOPXO2UDBLQLE2OXANCNFSM4IUZH2AA> .

-- Hilsen Bjørn Sørensen Tlf.: (+45) 28447177

skriss · 2019-09-30T15:08:59Z

Closing this out as inactive. Feel free to reach out again as needed.

SameeraGrandhi · 2020-06-17T11:44:30Z

I'm trying to create a Velero backup with feature CSI enabled on my azure cluster. I'm following the same instructions from the documentation.

I observed the backup has actually completed and failed at the end with the same transport error mentioned in this thread. Here are the log events that I have observed on the velero pod.

time="2020-06-17T10:59:14Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
time="2020-06-17T10:59:14Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/csi-try4 logSource="pkg/controller/backup_controller.go:273"

Any idea on what went wrong during the backup?

ghost · 2020-06-18T15:30:03Z

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.

velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"

Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

skriss · 2020-06-18T15:39:00Z

I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren't playing nice together. Let us know if that fixes things!

ghost · 2020-06-18T15:54:41Z

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

Berndinox · 2020-06-22T11:52:04Z

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.
velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"
Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

Same here.
AKS 1.16
velero1.4

confirm: increasing the limits fixed the issue

skriss · 2020-06-22T15:38:53Z

@vmware-tanzu/velero-maintainers I'm guessing we should lower the value for this setting. I set it at 100MB since that's the max Azure allows, which means Velero will create the minimum number of chunks, but I think it's causing Velero to exceed its default limits regularly.

We could probably drop the chunk size down to something significantly smaller and it wouldn't have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.

fatihdestegul · 2020-10-14T12:27:21Z

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

This has solved my problem...

Velero 1.5.1
Azure plugin: 1.1.0

suleymanakbas91 mentioned this issue Sep 9, 2019

Backup tests randomly fail kyma-project/kyma#5434

Closed

prydonius added Area/Plugins Issues related to plugin infra/internal plugins Waiting for info labels Sep 9, 2019

carlisia mentioned this issue Sep 10, 2019

Ark server needs to handle crashed/stopped plugin processes #481

Closed

skriss closed this as completed Sep 30, 2019

tapanhalani mentioned this issue Feb 25, 2021

Velero helm release marks backup "failed" even after backing up all items successfully vmware-tanzu/helm-charts#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup fails with transport is closing #1856

Backup fails with transport is closing #1856

suleymanakbas91 commented Sep 9, 2019 •

edited

Loading

prydonius commented Sep 9, 2019

prydonius commented Sep 9, 2019

suleymanakbas91 commented Sep 10, 2019

prydonius commented Sep 10, 2019

skriss commented Sep 10, 2019

suleymanakbas91 commented Sep 10, 2019

Crevil commented Sep 19, 2019

prydonius commented Sep 19, 2019

Crevil commented Sep 20, 2019

prydonius commented Sep 20, 2019

Crevil commented Sep 21, 2019 via email

Crevil commented Sep 23, 2019

prydonius commented Sep 23, 2019

Crevil commented Sep 24, 2019 via email

skriss commented Sep 30, 2019

SameeraGrandhi commented Jun 17, 2020

ghost commented Jun 18, 2020 •

edited by ghost

Loading

skriss commented Jun 18, 2020

ghost commented Jun 18, 2020

Berndinox commented Jun 22, 2020 •

edited

Loading

skriss commented Jun 22, 2020

fatihdestegul commented Oct 14, 2020

Backup fails with transport is closing #1856

Backup fails with transport is closing #1856

Comments

suleymanakbas91 commented Sep 9, 2019 • edited Loading

prydonius commented Sep 9, 2019

prydonius commented Sep 9, 2019

suleymanakbas91 commented Sep 10, 2019

prydonius commented Sep 10, 2019

skriss commented Sep 10, 2019

suleymanakbas91 commented Sep 10, 2019

Crevil commented Sep 19, 2019

prydonius commented Sep 19, 2019

Crevil commented Sep 20, 2019

prydonius commented Sep 20, 2019

Crevil commented Sep 21, 2019 via email

Crevil commented Sep 23, 2019

prydonius commented Sep 23, 2019

Crevil commented Sep 24, 2019 via email

skriss commented Sep 30, 2019

SameeraGrandhi commented Jun 17, 2020

ghost commented Jun 18, 2020 • edited by ghost Loading

skriss commented Jun 18, 2020

ghost commented Jun 18, 2020

Berndinox commented Jun 22, 2020 • edited Loading

skriss commented Jun 22, 2020

fatihdestegul commented Oct 14, 2020

suleymanakbas91 commented Sep 9, 2019 •

edited

Loading

ghost commented Jun 18, 2020 •

edited by ghost

Loading

Berndinox commented Jun 22, 2020 •

edited

Loading