-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup fails with transport is closing #1856
Comments
I believe the gRPC error comes from the trying to talk to a plugin. If you add the |
looks like the same issue as #481 |
It's strange because we are not using any plugins. Here are the error logs after I set the log level as debug:
|
@suleymanakbas91 good to know you don't have any external plugins, but Velero does include a set of default plugins so this is coming from those. |
@suleymanakbas91 do you have CPU/mem requests/limits defined for your Velero deployment? |
@skriss we don't specify any. You can also check out the helm chart we use from here: https://github.com/kyma-project/kyma/tree/master/resources/backup |
We have run in to this as well and I just tried removing CPU/mem limits for the deployment and it looks much more stable. |
@Crevil @suleymanakbas91 are you able to see how much CPU/mem the Velero Pod is using? |
Sure thing @prydonius We run backups every hour as seen by the spikes. |
@Crevil thanks for that! Were you previously using the default requests/limits provided by |
I’ll ger back to you on monday with more details.
On Sat, 21 Sep 2019 at 01.39, Adnan Abdulhussein ***@***.***> wrote:
@Crevil <https://github.com/Crevil> thanks for that! Were you previously
using the default requests/limits provided by velero install? Your usage
definitely goes over those defaults, could you tell us a little more about
what you're including in your backups (e.g. no. of resources, whether
you're using restic or not, and volume sizes if using restic)? Just trying
to gauge if it makes sense to increase the default req/limits.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1856>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUQDHV5M7Q64CLLDI4CMFDQKVNJRANCNFSM4IUZH2AA>
.
--
Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177
|
We were using limits as specified from the apiVersion: apps/v1
kind: Deployment
metadata:
name: velero
spec:
template:
spec:
containers:
- name: velero
resources:
limits:
cpu: "1"
memory: 256Mi
requests:
cpu: 500m
memory: 128Mi We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots. Let me know if you need any thing else. |
Really appreciate the info @Crevil. Strange, your backups are smaller than this user's, though their resource usage was lower. It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits? @suleymanakbas91 are you still experiencing this issue? |
It looks stable now, yes. We’ll set up an alert on the consumption to be
warned in the future.
On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein ***@***.***> wrote:
Really appreciate the info @Crevil <https://github.com/Crevil>. Strange,
your backups are smaller than this user's
<#94 (comment)>,
though their resource usage was lower.
It's difficult to come up with a baseline that works for everyone, our
best recommendation would be to monitor resource usage and set appropriate
reqs/limits for your environment. Has the Pod remained stable since
removing the default reqs/limits?
@suleymanakbas91 <https://github.com/suleymanakbas91> are you still
experiencing this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1856>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUQDHQXXTW7JEOPXO2UDBLQLE2OXANCNFSM4IUZH2AA>
.
--
Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177
|
Closing this out as inactive. Feel free to reach out again as needed. |
I'm trying to create a Velero backup with feature CSI enabled on my azure cluster. I'm following the same instructions from the documentation. I observed the backup has actually completed and failed at the end with the same transport error mentioned in this thread. Here are the log events that I have observed on the velero pod.
Any idea on what went wrong during the backup? |
Same for me on Azure AKS with advanced networking. velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273" Just to be complete: |
I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren't playing nice together. Let us know if that fixes things! |
Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now... |
Same here. confirm: increasing the limits fixed the issue |
@vmware-tanzu/velero-maintainers I'm guessing we should lower the value for this setting. I set it at 100MB since that's the max Azure allows, which means Velero will create the minimum number of chunks, but I think it's causing Velero to exceed its default limits regularly. We could probably drop the chunk size down to something significantly smaller and it wouldn't have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it. |
This has solved my problem... Velero 1.5.1 |
What steps did you take and what happened:
We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:
Any idea why this happens and is there anything we can do to prevent this?
Anything else you would like to add:
Here is the backup file we use:
We just deploy this file to the cluster using
kubectl apply -f
.Environment:
velero version
): 1.0.0kubectl version
): 1.13.9-gke.3/etc/os-release
):The text was updated successfully, but these errors were encountered: