Skip to content

Commit

Permalink
feat: add retry logic for k8s client argoproj#7692 (argoproj#16154)
Browse files Browse the repository at this point in the history
* add retry logic for k8s client

Signed-off-by: Pavel Aborilov <[email protected]>

* add docs for retry logic and envs to manifests

Signed-off-by: Pavel Aborilov <[email protected]>

---------

Signed-off-by: Pavel Aborilov <[email protected]>
Signed-off-by: Pavel <[email protected]>
  • Loading branch information
aborilov committed Apr 29, 2024
1 parent f2ae45b commit 5f88feb
Show file tree
Hide file tree
Showing 12 changed files with 327 additions and 52 deletions.
10 changes: 10 additions & 0 deletions docs/operator-manual/argocd-cmd-params-cm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ data:
controller.sharding.algorithm: legacy
# Number of allowed concurrent kubectl fork/execs. Any value less than 1 means no limit.
controller.kubectl.parallelism.limit: "20"
# The maximum number of retries for each request
controller.k8sclient.retry.max: "0"
# The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold
controller.k8sclient.retry.base.backoff: "100"
# Grace period in seconds for ignoring consecutive errors while communicating with repo server.
controller.repo.error.grace.period.seconds: "180"

## Server properties
# Listen on given address for incoming connections (default "0.0.0.0")
Expand All @@ -75,6 +81,10 @@ data:
# Semicolon-separated list of content types allowed on non-GET requests. Set an empty string to allow all. Be aware
# that allowing content types besides application/json may make your API more vulnerable to CSRF attacks.
server.api.content.types: "application/json"
# The maximum number of retries for each request
server.k8sclient.retry.max: "0"
# The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold
server.k8sclient.retry.base.backoff: "100"

# Set the logging format. One of: text|json (default "text")
server.log.format: "text"
Expand Down
88 changes: 88 additions & 0 deletions docs/operator-manual/high_availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,91 @@ spec:
path: my-application
# ...
```

## Rate Limiting Application Reconciliations

To prevent high controller resource usage or sync loops caused either due to misbehaving apps or other environment specific factors,
we can configure rate limits on the workqueues used by the application controller. There are two types of rate limits that can be configured:

* Global rate limits
* Per item rate limits

The final rate limiter uses a combination of both and calculates the final backoff as `max(globalBackoff, perItemBackoff)`.

### Global rate limits

This is enabled by default, it is a simple bucket based rate limiter that limits the number of items that can be queued per second.
This is useful to prevent a large number of apps from being queued at the same time.

To configure the bucket limiter you can set the following environment variables:

* `WORKQUEUE_BUCKET_SIZE` - The number of items that can be queued in a single burst. Defaults to 500.
* `WORKQUEUE_BUCKET_QPS` - The number of items that can be queued per second. Defaults to 50.

### Per item rate limits

This by default returns a fixed base delay/backoff value but can be configured to return exponential values, read further to understand it's working.
Per item rate limiter limits the number of times a particular item can be queued. This is based on exponential backoff where the backoff time for an item keeps increasing exponentially
if it is queued multiple times in a short period, but the backoff is reset automatically if a configured `cool down` period has elapsed since the last time the item was queued.

To configure the per item limiter you can set the following environment variables:

* `WORKQUEUE_FAILURE_COOLDOWN_NS` : The cool down period in nanoseconds, once period has elapsed for an item the backoff is reset. Exponential backoff is disabled if set to 0(default), eg. values : 10 * 10^9 (=10s)
* `WORKQUEUE_BASE_DELAY_NS` : The base delay in nanoseconds, this is the initial backoff used in the exponential backoff formula. Defaults to 1000 (=1μs)
* `WORKQUEUE_MAX_DELAY_NS` : The max delay in nanoseconds, this is the max backoff limit. Defaults to 3 * 10^9 (=3s)
* `WORKQUEUE_BACKOFF_FACTOR` : The backoff factor, this is the factor by which the backoff is increased for each retry. Defaults to 1.5

The formula used to calculate the backoff time for an item, where `numRequeue` is the number of times the item has been queued
and `lastRequeueTime` is the time at which the item was last queued:

- When `WORKQUEUE_FAILURE_COOLDOWN_NS` != 0 :

```
backoff = time.Since(lastRequeueTime) >= WORKQUEUE_FAILURE_COOLDOWN_NS ?
WORKQUEUE_BASE_DELAY_NS :
min(
WORKQUEUE_MAX_DELAY_NS,
WORKQUEUE_BASE_DELAY_NS * WORKQUEUE_BACKOFF_FACTOR ^ (numRequeue)
)
```

- When `WORKQUEUE_FAILURE_COOLDOWN_NS` = 0 :

```
backoff = WORKQUEUE_BASE_DELAY_NS
```

## HTTP Request Retry Strategy

In scenarios where network instability or transient server errors occur, the retry strategy ensures the robustness of HTTP communication by automatically resending failed requests. It uses a combination of maximum retries and backoff intervals to prevent overwhelming the server or thrashing the network.

### Configuring Retries

The retry logic can be fine-tuned with the following environment variables:

* `ARGOCD_K8SCLIENT_RETRY_MAX` - The maximum number of retries for each request. The request will be dropped after this count is reached. Defaults to 0 (no retries).
* `ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF` - The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold. Defaults to 100ms.

### Backoff Strategy

The backoff strategy employed is a simple exponential backoff without jitter. The backoff time increases exponentially with each retry attempt until a maximum backoff duration is reached.

The formula for calculating the backoff time is:

```
backoff = min(retryWaitMax, baseRetryBackoff * (2 ^ retryAttempt))
```
Where `retryAttempt` starts at 0 and increments by 1 for each subsequent retry.

### Maximum Wait Time

There is a cap on the backoff time to prevent excessive wait times between retries. This cap is defined by:

`retryWaitMax` - The maximum duration to wait before retrying. This ensures that retries happen within a reasonable timeframe. Defaults to 10 seconds.

### Non-Retriable Conditions

Not all HTTP responses are eligible for retries. The following conditions will not trigger a retry:

* Responses with a status code indicating client errors (4xx) except for 429 Too Many Requests.
* Responses with the status code 501 Not Implemented.
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,18 @@ spec:
name: argocd-cmd-params-cm
key: controller.ignore.normalizer.jq.timeout
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: controller.k8sclient.retry.max
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: controller.k8sclient.retry.base.backoff
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
12 changes: 12 additions & 0 deletions manifests/base/server/argocd-server-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,18 @@ spec:
name: argocd-cmd-params-cm
key: server.api.content.types
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: server.k8sclient.retry.max
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: server.k8sclient.retry.base.backoff
optional: true
volumeMounts:
- name: ssh-known-hosts
mountPath: /app/config/ssh
Expand Down
12 changes: 12 additions & 0 deletions manifests/core-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19451,6 +19451,18 @@ spec:
key: controller.ignore.normalizer.jq.timeout
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
52 changes: 6 additions & 46 deletions manifests/ha/base/redis-ha/chart/upstream.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1080,13 +1080,7 @@ spec:
args:
- /readonly/haproxy_init.sh
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
null
volumeMounts:
- name: config-volume
mountPath: /readonly
Expand All @@ -1098,13 +1092,7 @@ spec:
image: haproxy:2.6.14-alpine
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
null
livenessProbe:
httpGet:
path: /healthz
Expand Down Expand Up @@ -1200,14 +1188,7 @@ spec:
args:
- /readonly-config/init.sh
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
null
env:
- name: SENTINEL_ID_0
value: 3c0d9c0320bb34888c2df5757c718ce6ca992ce6
Expand All @@ -1232,14 +1213,7 @@ spec:
args:
- /data/conf/redis.conf
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
null
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 15
Expand Down Expand Up @@ -1289,14 +1263,7 @@ spec:
args:
- /data/conf/sentinel.conf
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
null
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 15
Expand Down Expand Up @@ -1340,14 +1307,7 @@ spec:
args:
- /readonly-config/fix-split-brain.sh
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
null
env:
- name: SENTINEL_ID_0
value: 3c0d9c0320bb34888c2df5757c718ce6ca992ce6
Expand Down
24 changes: 24 additions & 0 deletions manifests/ha/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20995,6 +20995,18 @@ spec:
key: server.api.content.types
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -21247,6 +21259,18 @@ spec:
key: controller.ignore.normalizer.jq.timeout
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
24 changes: 24 additions & 0 deletions manifests/ha/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2507,6 +2507,18 @@ spec:
key: server.api.content.types
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -2759,6 +2771,18 @@ spec:
key: controller.ignore.normalizer.jq.timeout
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
24 changes: 24 additions & 0 deletions manifests/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20050,6 +20050,18 @@ spec:
key: server.api.content.types
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -20302,6 +20314,18 @@ spec:
key: controller.ignore.normalizer.jq.timeout
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.17
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
Loading

0 comments on commit 5f88feb

Please sign in to comment.