Application controller should be more resilient to network latency or K8s API server hiccups #7692

jannfis · 2021-11-12T12:43:08Z

Summary

On network links with a very high latency, or which are very slow, sometimes a remote Kubernetes API drops the connection before a certain request (e.g. fetching remote resources, list of available APIs, etc) could be completed. This currently leads to comparison errors in the application or in aborting other operations (e.g. retrieving manifests using argocd app manifests). Effectually, this can prevent a sync from happening when such a hiccup occurs within a given operation, because the errors are considered fatal.

This can be observed whenever the network link between the Argo CD control plane and the managed cluster may be a little flaky.

Motivation

Improve stability on high latency networks and don't fail operations on the first hiccup.

Proposal

On certain, non-permanent errors when accessing remote Kubernetes API endpoints, we should retry the request when it fails.

The text was updated successfully, but these errors were encountered:

* add retry logic for k8s client Signed-off-by: Pavel Aborilov <[email protected]> * add docs for retry logic and envs to manifests Signed-off-by: Pavel Aborilov <[email protected]> --------- Signed-off-by: Pavel Aborilov <[email protected]> Signed-off-by: Pavel <[email protected]>

* add retry logic for k8s client Signed-off-by: Pavel Aborilov <[email protected]> * add docs for retry logic and envs to manifests Signed-off-by: Pavel Aborilov <[email protected]> --------- Signed-off-by: Pavel Aborilov <[email protected]> Signed-off-by: Pavel <[email protected]> Signed-off-by: jmilic1 <[email protected]>

* add retry logic for k8s client Signed-off-by: Pavel Aborilov <[email protected]> * add docs for retry logic and envs to manifests Signed-off-by: Pavel Aborilov <[email protected]> --------- Signed-off-by: Pavel Aborilov <[email protected]> Signed-off-by: Pavel <[email protected]>

* add retry logic for k8s client Signed-off-by: Pavel Aborilov <[email protected]> * add docs for retry logic and envs to manifests Signed-off-by: Pavel Aborilov <[email protected]> --------- Signed-off-by: Pavel Aborilov <[email protected]> Signed-off-by: Pavel <[email protected]> Signed-off-by: Kevin Lyda <[email protected]>

* add retry logic for k8s client Signed-off-by: Pavel Aborilov <[email protected]> * add docs for retry logic and envs to manifests Signed-off-by: Pavel Aborilov <[email protected]> --------- Signed-off-by: Pavel Aborilov <[email protected]> Signed-off-by: Pavel <[email protected]>

jannfis added enhancement New feature or request component:core Syncing, diffing, cluster state cache type:scalability Issues related to scalability and performance related issues type:supportability Enhancements that help operators to run Argo CD labels Nov 12, 2021

This was referenced Oct 27, 2023

feat: add retry logic for k8s client #7692 aborilov/argo-cd#3

Open

feat: add retry logic for k8s client #7692 #16154

Merged

alexmt closed this as completed in #16154 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application controller should be more resilient to network latency or K8s API server hiccups #7692

Application controller should be more resilient to network latency or K8s API server hiccups #7692

jannfis commented Nov 12, 2021

Application controller should be more resilient to network latency or K8s API server hiccups #7692

Application controller should be more resilient to network latency or K8s API server hiccups #7692

Comments

jannfis commented Nov 12, 2021

Summary

Motivation

Proposal