Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero : Throttling request errors #3191

Closed
srajput1991 opened this issue Dec 16, 2020 · 5 comments · Fixed by #7311
Closed

Velero : Throttling request errors #3191

srajput1991 opened this issue Dec 16, 2020 · 5 comments · Fixed by #7311
Assignees
Labels
Enhancement/User End-User Enhancement to Velero Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! kind/requirement Reviewed Q2 2021

Comments

@srajput1991
Copy link

srajput1991 commented Dec 16, 2020

[carlisia] Update on 1/13: this became a task to update our server defaults.


Hi,

I am using velero to take backup of my k8s resources. I am seeing a lot errors in datadog from velero as below:

1 request.go:621] **Throttling request took** 1.047035991s, request: GET:https://10.0.0.1:443/apis/autoscaling/v2beta2?timeout=32s
@zubron
Copy link
Contributor

zubron commented Jan 6, 2021

Hi @srajput1991 - Apologies for the delay in getting back to you on this.

These messages are coming from client-go. After looking through some comments on the Kubernetes slack, it appears to be affecting other projects. It's not clear if it's an issue with the API server being slow to respond and so the client reports that the request took over the maximum time (1s) or whether this is an indication to the client to slow down the rate of requests.

My understanding is that this shouldn't impact the functionality of Velero. We should investigate though to see if there are better default settings for QPS and Burst that we could be using in the config for the client.

See the following settings for the velero server:

--client-burst int      Maximum number of requests by the server to the Kubernetes API in a short period of time. (default 30)
--client-qps float32    Maximum number of requests per second by the server to the Kubernetes API once the burst limit has been reached. (default 20)

@nrb nrb self-assigned this Jan 11, 2021
@nrb
Copy link
Contributor

nrb commented Jan 11, 2021

After doing some digging in the code, the underlying struct used is this Golang rate limiter and its Wait method.

Here's the key sentence:

If no token is available, Wait blocks until one [token] can be obtained or its associated context.Context is canceled.

So what I believe is happening is that the Wait call is blocking, waiting for a token to be re-added to the bucket, and logging these messages when it takes too long (according to the client-go code, over 1 second).

As @zubron said, the size of the bucket can be increased with the --client-burst value (default 30), and the rate of refilling tokens can be changed with --client-qps (default 20). That said, I think we should look at altering the defaults here, because we added more controllers with more things to watch.

I don't have a good handle on what values to set at the moment. I'm going to do some experimenting, but I think moving up to 50 burst/40 QPS would be a good start.

@nrb
Copy link
Contributor

nrb commented Jan 11, 2021

Looking around, Prometheus uses 100/100 and ingress-nginx used to use 1,000,000 for both, though it appears they've since removed any default values.

We don't want Velero to spam the API server, so I think a value of 1,000,000 is too much, but moving up into the order of 100 or so is a reasonable start.

@nrb
Copy link
Contributor

nrb commented Jan 11, 2021

@srajput1991 the following values made the throttling messages go away in a test cluster. Highlighted inline.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: "2021-01-11T23:19:08Z"
  generation: 3
  labels:
    component: velero
  name: velero
  namespace: velero
  resourceVersion: "3678"
  selfLink: /apis/apps/v1/namespaces/velero/deployments/velero
  uid: 2ccad69e-f561-4497-98d8-9b4604073580
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      deploy: velero
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        component: velero
        deploy: velero
    spec:
      containers:
      - args:
        - server
        - --client-qps=75.0 <----- HERE
        - --client-burst=100 <----- HERE
        - --features=
        command:
        - /velero
        env:
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: LD_LIBRARY_PATH
          value: /plugins
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /credentials/cloud
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: AZURE_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: ALIBABA_CLOUD_CREDENTIALS_FILE
          value: /credentials/cloud
        image: velero/velero:main
        imagePullPolicy: IfNotPresent
        name: velero
        ports:
        - containerPort: 8085
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /plugins
          name: plugins
        - mountPath: /scratch
          name: scratch
        - mountPath: /credentials
          name: cloud-credentials
      dnsPolicy: ClusterFirst
      initContainers:
      - image: velero/velero-plugin-for-gcp:v1.1.0
        imagePullPolicy: IfNotPresent
        name: velero-plugin-for-gcp
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: velero
      serviceAccountName: velero
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: plugins
      - emptyDir: {}
        name: scratch
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: cloud-credentials
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-01-11T23:19:26Z"
    lastUpdateTime: "2021-01-11T23:19:26Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-01-11T23:19:08Z"
    lastUpdateTime: "2021-01-11T23:21:40Z"
    message: ReplicaSet "velero-6d9c7fc787" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

This is something that you can change in your own deployment, though I think we'll get the defaults updated within Velero's code, too.

@tareksha
Copy link

tareksha commented Feb 23, 2021

Running into this issue too. I've created vmware-tanzu/helm-charts#222 to allow each user to customize the client's qps and burst rate from chart values conveniently.

@eleanor-millman eleanor-millman added Reviewed Q2 2021 Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! labels May 12, 2021
@ywk253100 ywk253100 assigned ywk253100 and unassigned nrb Jan 12, 2024
ywk253100 added a commit to ywk253100/velero that referenced this issue Jan 12, 2024
Increase the k8s client QPS/burst to avoid throttling request errors

Fixes vmware-tanzu#7127
Fixes vmware-tanzu#3191

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero Good first issue Looking to contribute to Velero? Issues with this label might be a great place to start! kind/requirement Reviewed Q2 2021
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants