Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

Open
behroozam opened this issue Feb 6, 2024 · 8 comments
Open

Comments

@behroozam
Copy link

behroozam commented Feb 6, 2024

We have a couple of AKS clusters in Azure and we enabled the CSI feature to snapshot/backup PVCs.
It's working fine for the clusters with fewer PVCs but for a cluster with 65 PVCs azure starts sending a 429 response code.

The issue is that whenever the snapshoter tries to take a snapshot or fetch the existing backup it tries to use Azure API to list the storage account keys and It exhausts Azure API.

This is the error code:

 failed with storage.FileSharesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code=\\\"TooManyRequests\\\" Message=\\\"The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits\\\", accountName: \\\"<storageaccountname>\\\"\"" backup=<veleronamespace>/<veleroPod>cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:259" pluginName=velero-plugin-for-csi

and this is the Azure activity on the storage account which is currently throttled.
Screenshot 2024-02-06 105147

A possible fix for the issue: caching the storage account key instead of listing keys each time for each request

  • Velero version: 1.13.0
  • Velero features: EnableCSI
  • Kubernetes version: v1.27.3
  • Cloud provider or hardware configuration: Azure AKS
  • OS : azure linux
@Lyndon-Li
Copy link
Contributor

Looks like it was the external-snapshotter called the Azure API, if so, it doesn't make any help even if Velero caches the storage account key. Anything I missed?

@behroozam
Copy link
Author

Looks like it was the external-snapshotter called the Azure API, if so, it doesn't make any help even if Velero caches the storage account key. Anything I missed?

We are using velero snapshotclass

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Retain
driver: disk.csi.azure.com
kind: VolumeSnapshotClass
metadata:
  generation: 1
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  name: velero-csi-disk-volume-snapshot-class

@Lyndon-Li
Copy link
Contributor

This snapshotclass is labeled for Velero with velero.io/csi-volumesnapshot-class, but it doesn't mean Velero drives the snapshot creation, and the underlying role to take the snapshot is still external-snapshotter and Azure Disk CSI driver (disk.csi.azure.com)

@behroozam
Copy link
Author

Thank you for your replay @Lyndon-Li
Perhaps we could add a delay option for taking snapshots of each PVC, given that Valero is the controller that triggers the external snapshotter.

@ywk253100
Copy link
Contributor

You can set useAAD=true in the BSL config to avoid calling list storage account API in the Velero Azure plugin, search useAAD in https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure for more information

@behroozam
Copy link
Author

behroozam commented Feb 19, 2024

I've tried both the useAAD and storageAccountKeyEnvVar for fallback, together and individually.
It seems that the snapshooter doesn't care about the configuration and tries to fetch the existing snapshots by listing the API keys on the backup storage account.
I'm also getting 429 error messages on the existing storage account on the AKS cluster for current PVCs

Read_ObservationWindow_00:05:00

which is fairly similar to this issue on Kubernetes.
And also this one on openshift platform.

@ywk253100
Copy link
Contributor

The useAAD doesn't impact the behavior of the snapshooter, it only impacts the Velero Azure plugin which also lists storage account access key if useAAD=false. I thought decreasing the requests made from Velero Azure plugin side would mitigate the throtting issue, but seems it doesn't. That makes sence because the Velero Azure plugin and the snapshooter use two different credentials.

Could you run the velero debug command and provide us the debug bundle?

As @Lyndon-Li said, this may be the issue of snapshooter and Azure CSI driver, we could do nothing on Velero side. But let's gather the debug bundle and check it again.

@anshulahuja98
Copy link
Collaborator

anshulahuja98 commented Jul 4, 2024

If CSI is the flow used, this might be related to #7978

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants