Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup fails with transport is closing #1856

Closed
suleymanakbas91 opened this issue Sep 9, 2019 · 22 comments
Closed

Backup fails with transport is closing #1856

suleymanakbas91 opened this issue Sep 9, 2019 · 22 comments
Labels
Area/Plugins Issues related to plugin infra/internal plugins

Comments

@suleymanakbas91
Copy link

suleymanakbas91 commented Sep 9, 2019

What steps did you take and what happened:

We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:

time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"

Any idea why this happens and is there anything we can do to prevent this?

Anything else you would like to add:

Here is the backup file we use:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kyma-backup
  namespace: kyma-system
spec:
  includedNamespaces:
  - '*'
  includedResources:
  - '*'
  includeClusterResources: true
  storageLocation: default
  volumeSnapshotLocations: 
  - default

We just deploy this file to the cluster using kubectl apply -f.

Environment:

  • Velero version (use velero version): 1.0.0
  • Kubernetes version (use kubectl version): 1.13.9-gke.3
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release):
@prydonius
Copy link
Contributor

I believe the gRPC error comes from the trying to talk to a plugin. If you add the --log-level debug flag to your velero server Pod, we might be able to get more info about what's happening here.

@prydonius prydonius added Area/Plugins Issues related to plugin infra/internal plugins Waiting for info labels Sep 9, 2019
@prydonius
Copy link
Contributor

looks like the same issue as #481

@suleymanakbas91
Copy link
Author

It's strange because we are not using any plugins.

Here are the error logs after I set the log level as debug:

$ kubectl logs backup-75d69b8644-fjlz7 -n kyma-system | grep error
time="2019-09-10T14:03:28Z" level=error msg="reading plugin stderr" cmd=/velero controller=backup-sync error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=error msg="reading plugin stderr" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=debug msg="plugin process exited" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="signal: killed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/velero pid=136
Failed to fire hook: object logged as error does not satisfy error interface
time="2019-09-10T14:03:36Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 logSource="pkg/controller/backup_controller.go:230"

Full pod logs

@prydonius
Copy link
Contributor

@suleymanakbas91 good to know you don't have any external plugins, but Velero does include a set of default plugins so this is coming from those.

cc @skriss @nrb @carlisia any ideas?

@skriss
Copy link
Contributor

skriss commented Sep 10, 2019

@suleymanakbas91 do you have CPU/mem requests/limits defined for your Velero deployment?

@suleymanakbas91
Copy link
Author

@skriss we don't specify any. You can also check out the helm chart we use from here: https://github.com/kyma-project/kyma/tree/master/resources/backup

@Crevil
Copy link

Crevil commented Sep 19, 2019

We have run in to this as well and I just tried removing CPU/mem limits for the deployment and it looks much more stable.

@prydonius
Copy link
Contributor

@Crevil @suleymanakbas91 are you able to see how much CPU/mem the Velero Pod is using?

@Crevil
Copy link

Crevil commented Sep 20, 2019

Sure thing @prydonius
image

We run backups every hour as seen by the spikes.

@prydonius
Copy link
Contributor

@Crevil thanks for that! Were you previously using the default requests/limits provided by velero install? Your usage definitely goes over those defaults, could you tell us a little more about what you're including in your backups (e.g. no. of resources, whether you're using restic or not, and volume sizes if using restic)? Just trying to gauge if it makes sense to increase the default req/limits.

@Crevil
Copy link

Crevil commented Sep 21, 2019 via email

@Crevil
Copy link

Crevil commented Sep 23, 2019

We were using limits as specified from the install command. Here is a shortened version of what we were running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: velero
spec:
  template:
    spec:
      containers:
      - name: velero
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi

We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots.
We are not using restic.

Let me know if you need any thing else.

@prydonius
Copy link
Contributor

Really appreciate the info @Crevil. Strange, your backups are smaller than this user's, though their resource usage was lower.

It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits?

@suleymanakbas91 are you still experiencing this issue?

@Crevil
Copy link

Crevil commented Sep 24, 2019 via email

@skriss
Copy link
Contributor

skriss commented Sep 30, 2019

Closing this out as inactive. Feel free to reach out again as needed.

@skriss skriss closed this as completed Sep 30, 2019
@SameeraGrandhi
Copy link

I'm trying to create a Velero backup with feature CSI enabled on my azure cluster. I'm following the same instructions from the documentation.

I observed the backup has actually completed and failed at the end with the same transport error mentioned in this thread. Here are the log events that I have observed on the velero pod.

time="2020-06-17T10:59:14Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
time="2020-06-17T10:59:14Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/csi-try4 logSource="pkg/controller/backup_controller.go:273"

Any idea on what went wrong during the backup?

@ghost
Copy link

ghost commented Jun 18, 2020

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.

velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"

Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

@skriss
Copy link
Contributor

skriss commented Jun 18, 2020

I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren't playing nice together. Let us know if that fixes things!

@ghost
Copy link

ghost commented Jun 18, 2020

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

@Berndinox
Copy link

Berndinox commented Jun 22, 2020

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.

velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"

Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

Same here.
AKS 1.16
velero1.4

confirm: increasing the limits fixed the issue

@skriss
Copy link
Contributor

skriss commented Jun 22, 2020

@vmware-tanzu/velero-maintainers I'm guessing we should lower the value for this setting. I set it at 100MB since that's the max Azure allows, which means Velero will create the minimum number of chunks, but I think it's causing Velero to exceed its default limits regularly.

We could probably drop the chunk size down to something significantly smaller and it wouldn't have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.

@fatihdestegul
Copy link

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

This has solved my problem...

Velero 1.5.1
Azure plugin: 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Plugins Issues related to plugin infra/internal plugins
Projects
None yet
Development

No branches or pull requests

7 participants