Velero : Throttling request errors #3191

srajput1991 · 2020-12-16T05:14:06Z

[carlisia] Update on 1/13: this became a task to update our server defaults.

Hi,

I am using velero to take backup of my k8s resources. I am seeing a lot errors in datadog from velero as below:

1 request.go:621] **Throttling request took** 1.047035991s, request: GET:https://10.0.0.1:443/apis/autoscaling/v2beta2?timeout=32s

The text was updated successfully, but these errors were encountered:

zubron · 2021-01-06T14:42:08Z

Hi @srajput1991 - Apologies for the delay in getting back to you on this.

These messages are coming from client-go. After looking through some comments on the Kubernetes slack, it appears to be affecting other projects. It's not clear if it's an issue with the API server being slow to respond and so the client reports that the request took over the maximum time (1s) or whether this is an indication to the client to slow down the rate of requests.

My understanding is that this shouldn't impact the functionality of Velero. We should investigate though to see if there are better default settings for QPS and Burst that we could be using in the config for the client.

See the following settings for the velero server:

--client-burst int      Maximum number of requests by the server to the Kubernetes API in a short period of time. (default 30)
--client-qps float32    Maximum number of requests per second by the server to the Kubernetes API once the burst limit has been reached. (default 20)

nrb · 2021-01-11T23:06:29Z

After doing some digging in the code, the underlying struct used is this Golang rate limiter and its Wait method.

Here's the key sentence:

If no token is available, Wait blocks until one [token] can be obtained or its associated context.Context is canceled.

So what I believe is happening is that the Wait call is blocking, waiting for a token to be re-added to the bucket, and logging these messages when it takes too long (according to the client-go code, over 1 second).

As @zubron said, the size of the bucket can be increased with the --client-burst value (default 30), and the rate of refilling tokens can be changed with --client-qps (default 20). That said, I think we should look at altering the defaults here, because we added more controllers with more things to watch.

I don't have a good handle on what values to set at the moment. I'm going to do some experimenting, but I think moving up to 50 burst/40 QPS would be a good start.

nrb · 2021-01-11T23:18:26Z

Looking around, Prometheus uses 100/100 and ingress-nginx used to use 1,000,000 for both, though it appears they've since removed any default values.

We don't want Velero to spam the API server, so I think a value of 1,000,000 is too much, but moving up into the order of 100 or so is a reasonable start.

nrb · 2021-01-11T23:25:08Z

@srajput1991 the following values made the throttling messages go away in a test cluster. Highlighted inline.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: "2021-01-11T23:19:08Z"
  generation: 3
  labels:
    component: velero
  name: velero
  namespace: velero
  resourceVersion: "3678"
  selfLink: /apis/apps/v1/namespaces/velero/deployments/velero
  uid: 2ccad69e-f561-4497-98d8-9b4604073580
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      deploy: velero
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        component: velero
        deploy: velero
    spec:
      containers:
      - args:
        - server
        - --client-qps=75.0 <----- HERE
        - --client-burst=100 <----- HERE
        - --features=
        command:
        - /velero
        env:
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: LD_LIBRARY_PATH
          value: /plugins
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /credentials/cloud
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: AZURE_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: ALIBABA_CLOUD_CREDENTIALS_FILE
          value: /credentials/cloud
        image: velero/velero:main
        imagePullPolicy: IfNotPresent
        name: velero
        ports:
        - containerPort: 8085
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /plugins
          name: plugins
        - mountPath: /scratch
          name: scratch
        - mountPath: /credentials
          name: cloud-credentials
      dnsPolicy: ClusterFirst
      initContainers:
      - image: velero/velero-plugin-for-gcp:v1.1.0
        imagePullPolicy: IfNotPresent
        name: velero-plugin-for-gcp
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: velero
      serviceAccountName: velero
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: plugins
      - emptyDir: {}
        name: scratch
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: cloud-credentials
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-01-11T23:19:26Z"
    lastUpdateTime: "2021-01-11T23:19:26Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-01-11T23:19:08Z"
    lastUpdateTime: "2021-01-11T23:21:40Z"
    message: ReplicaSet "velero-6d9c7fc787" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

This is something that you can change in your own deployment, though I think we'll get the defaults updated within Velero's code, too.

tareksha · 2021-02-23T21:09:46Z

Running into this issue too. I've created vmware-tanzu/helm-charts#222 to allow each user to customize the client's qps and burst rate from chart values conveniently.

Increase the k8s client QPS/burst to avoid throttling request errors Fixes vmware-tanzu#7127 Fixes vmware-tanzu#3191 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

zubron added the Needs investigation label Jan 6, 2021

nrb self-assigned this Jan 11, 2021

carlisia added Enhancement/User End-User Enhancement to Velero and removed Needs investigation labels Jan 13, 2021

tareksha mentioned this issue Feb 23, 2021

[velero] customize client-qps and client-burst vmware-tanzu/helm-charts#222

Merged

4 tasks

dsu-igeek added the P3 - Wouldn't it be nice if... label Mar 10, 2021

rverma-dev mentioned this issue Apr 25, 2021

Throttling cause of client-go fluxcd/flux2#1346

Closed

eleanor-millman added Reviewed Q2 2021 Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! labels May 12, 2021

eleanor-millman removed the P3 - Wouldn't it be nice if... label Sep 15, 2021

reasonerjt added the kind/requirement label May 20, 2022

ywk253100 mentioned this issue Nov 27, 2023

Error thrown not from Velero #7127

Closed

ywk253100 assigned ywk253100 and unassigned nrb Jan 12, 2024

ywk253100 mentioned this issue Jan 12, 2024

Increase the k8s client QPS/burst #7311

Merged

3 tasks

blackpiglet closed this as completed in #7311 Jan 15, 2024

blackpiglet closed this as completed in d676bfd Jan 15, 2024

jackstockley89 mentioned this issue Nov 25, 2024

Low Priority Alarms: VeleroBackupPartialFailures in manager ministryofjustice/cloud-platform#6426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero : Throttling request errors #3191

Velero : Throttling request errors #3191

srajput1991 commented Dec 16, 2020 •

edited by carlisia

Loading

zubron commented Jan 6, 2021

nrb commented Jan 11, 2021

nrb commented Jan 11, 2021

nrb commented Jan 11, 2021

tareksha commented Feb 23, 2021 •

edited

Loading

Velero : Throttling request errors #3191

Velero : Throttling request errors #3191

Comments

srajput1991 commented Dec 16, 2020 • edited by carlisia Loading

zubron commented Jan 6, 2021

nrb commented Jan 11, 2021

nrb commented Jan 11, 2021

nrb commented Jan 11, 2021

tareksha commented Feb 23, 2021 • edited Loading

srajput1991 commented Dec 16, 2020 •

edited by carlisia

Loading

tareksha commented Feb 23, 2021 •

edited

Loading