Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying failed sync's block newer commits; how to achieve declarative, level based gitops semantics? #11494

Open
3 tasks done
jlewi opened this issue Nov 29, 2022 · 29 comments · May be fixed by #15603
Open
3 tasks done
Labels
bug Something isn't working

Comments

@jlewi
Copy link

jlewi commented Nov 29, 2022

I originally raised this in #11276
I have since observed this behavior repeatedly and so I'm raising it as an issue.

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

ArgoCD will keep retrying the same commit that fails to sync properly even if there is a newer commit that fixes the sync.

To Reproduce

  1. Configure ArgoCD for infinite retries (see Application.yaml provided below)
  2. Add a YAML file containing a non-K8s resource to the repository being applied by Argo (see below).
  3. ArgoCD will try and fail to apply the resource
    • In this particular case it fails to apply the resource because the namespace NAMESPACETHATDOESNOTEXIST doesn't
      exist
    • My application.yaml deliberately sets the namespace to a non-existent namespace because if any of the resources don't explicitly set the namespace I want that to result in an error rather than defaulting to a value set by ArgoCD
  4. Now delete the YAML file containing the invalid resource and create a new commit
  5. ArgoCD keeps trying to apply the previous commit which will never succeed rather than moving onto the new commit.

Expected behavior

I expect ArgoCD's semantics to be level-based, declarative.

Level based means I'd expect ArgoCD keeps retrying a failed sync until either that sync succeeds or there is a newer commit.
For example, suppose I have an application that is creating a custom resource and the CRD hasn't been created yet (and is created by a separate process) or similarly the namespace doesn't exist and is created by separate process. Level based to me implies ArgoCD will keep retrying periodically so that in the event those issues are resolved the application gets created.

Declarative means I expect the desired state of the world to be the latest commit on the branch. Retrying an older commit once there is a newer commit violates that commit. Importantly, my expectation for gitops is that if there is a problem with a configuration I can fix it by pushing a new commit; if broken commits can block newer commits unless terminated then it doesn't seem like I can fix broken configurations just by pushing a new commit.

Do those expectations align with ArgoCD's? If they do, have I somehow misconfigured ArgoCD in order to achieve those semantics?

Screenshots

Version

argocd: v2.5.2+148d8da
  BuildDate: 2022-11-07T17:03:01Z
  GitCommit: 148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7
  GitTreeState: clean
  GoVersion: go1.18.7
  Compiler: gc
  Platform: darwin/arm64

Here's my ArgoCD application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app
  namespace: argocd
spec:
  # TODO(jeremy): What is project?
  project: default
  source:
    repoURL: [email protected]:jlewi/manifests.git
    targetRevision: HEAD
    path: app
    directory:
      # recurse is not true by default
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    # The namespace will only be set for namespace-scoped resources that have not set a value for .metadata.namespace
    # We intentionally set this to a namespace that doesn't exist because if resources
    # don't have .metadata.namespace set that should be an error
    namespace: NAMESPACETHATDOESNTEXIST
  syncPolicy:
    automated: # automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
      prune: true # Specifies if resources should be pruned during auto-syncing ( false by default ).
      selfHeal: true # Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
      allowEmpty: true # Allows deleting all application resources during automatic syncing ( false by default ).
    retry:
        limit: -1 # number of failed sync attempt retries; unlimited number of attempts if less than 0
        backoff:
          duration: 5s # the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")
          factor: 2 # a factor to multiply the base duration after each failed retry
          maxDuration: 3m # the maximum amount of time allowed for the backoff strategy

Here's an example of a YAML file that I included in my ArgoCD sync'd repository. It is not a K8s resource so it fails to be applied.

apiVersion: mlp.primer.ai/v1alpha1
kind: ManifestSync
metadata:
  name: app-dev-takeover
  labels: {}
spec:
  sourceRepo:
    org: jlewi
    repo: flock
    branch: dev-takeover
  forkRepo:
    org: jlewi
    repo: manifests
    branch: hydros/dev-takeover
  destRepo:
    org: jlewi
    repo: manifests
    branch: main
  sourcePath: manifests/app
  selector:
    matchLabels:
      app: app
      environment: dev
  destPath: app
status:
  sourceUrl: https://github.com/jlewi/manifests/tree/8a17313
  sourceCommit: 8a17313
@jlewi jlewi added the bug Something isn't working label Nov 29, 2022
@crenshaw-dev
Copy link
Member

My intuition is the same as yours. I feel like limit: -1 should mean "retry indefinitely or until there's a new commit."

@tybook
Copy link

tybook commented Sep 15, 2023

Any update here? Seems like pretty critical functionality.

In my limited testing on v2.8.0 I think the in-progress broken commit's sync does timeout after 10 minutes, but I can't find where this timeout is defined. EDIT: Nevermind, apparently this was a fluke; I can't reproduce.

In any case a fixed timeout is unsatisfactory. Upon a new commit being pushed, the auto-sync logic should immediately terminate any in-progress syncs and start a new one. For now the manual workaround is to click the TERMINATE button in the UI or run argocd app terminate-op <APPLICATION> with the CLI.

@aslafy-z
Copy link
Contributor

aslafy-z commented Sep 23, 2023

I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34
Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again.
This is not ideal but may unlock lot of folks.

@erikschul
Copy link

As mentioned in #15642
my view is that it would be better solved with an algorithm that detects lack of progress combined with a newer commit.
In my case, the lack of progress was obvious, i.e. a rejected apply due to invalid YAML, which should be caught by the algorithm, but I imagine it can be problematic to terminate early when dealing with finalizers and such.

@tybook
Copy link

tybook commented Sep 26, 2023

I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.

Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.

I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the argocd CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runs argocd app terminate-op on them. Combined with auto-sync and the ApplyOutOfSyncOnly=false sync option, I believe this makes for a reasonably safe and responsive workaround.

@Sayrus
Copy link

Sayrus commented Sep 27, 2023

I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.

Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.

I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the argocd CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runs argocd app terminate-op on them. Combined with auto-sync and the ApplyOutOfSyncOnly=false sync option, I believe this makes for a reasonably safe and responsive workaround.

We currently have external logic to terminate sync with the CLI too which is why I didn't continue on my PoC. The timeout can easily be made configurable since we have access to the App Spec within that context.

@jannfis
Copy link
Member

jannfis commented Sep 28, 2023

Related: #15624

@jannfis
Copy link
Member

jannfis commented Sep 28, 2023

My intuition is the same as yours. I feel like limit: -1 should mean "retry indefinitely or until there's a new commit."

Yes, I'd agree. Right now, the same parameters are re-used for any follow up sync operation (i.e. target revision, all source parameters, etc), which makes retries useless in some scenarios. We probably need to perform a refresh before the next retry, and update the operation with any new information (new commit SHA, changed source parameters, etc).

@jannfis
Copy link
Member

jannfis commented Sep 28, 2023

I have a fix for this, which I want to discuss in today's contributor's meeting

@tybook
Copy link

tybook commented Oct 2, 2023 via email

@aslafy-z
Copy link
Contributor

aslafy-z commented Oct 2, 2023

Here is the recording: https://youtu.be/baIX9Bk6f5w?t=1173. Alex was absent to the meeting and we'd like his take on this. What I understand is we'd prefer to Terminate a sync when a new revision is found instead of mutating the current retry with the new revision. This way, the whole sync process is redone, history is consistent and hooks are replayed if they change for example.
I opened a draft PR that does exactly that in here: #15603. I'd love it to work when the sync is in the hook phase too, but I'm unable to locate where in the code this is executed.
@alexec @crenshaw-dev @jannfis Please have a look to the PR 🙏

@mpiercy827
Copy link

@aslafy-z Any update on this issue? We've been encountering this issue recently and it would be nice to have new commits roll out and fix failed syncs 👍

@aslafy-z
Copy link
Contributor

@mpiercy827 I'd love this PR to move forward. I removed the Draft tag and would love some reviews! @alexec @crenshaw-dev @jannfis please have a look! 🙏

@jbartyze-rh
Copy link

Here is sample workaround script for openshift-gitops-operator 1.11 that translates to ArgoCD 2.9.2.

I did not see it anywhere on the internet or in this issue, so leaving it here for someone who is struggling with this problem.

Inject into cronjob with oc/kubectl and you are good to go.

#!/bin/bash

# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.

# Get list of all ArgoCD applications in the '<argocd_namespace>' namespace
applications=$(oc get applications -n <argocd_namespace> -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

for app in $applications; do
  echo "Processing application: $app"

  # Get the currently running sync revision
  current_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.operation.sync.revision}')

  # Get the desired sync revision
  desired_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.status.sync.revision}')

  echo "Current revision for $app: $current_revision"
  echo "Desired revision for $app: $desired_revision"

  # Skip the application if either the current or desired revision is empty
  if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
    echo "Skipping application $app due to missing revision information."
    continue
  fi

  # Compare the two revisions
  if [ "$current_revision" != "$desired_revision" ]; then
    echo "Revision mismatch detected for application: $app. Terminating operation."
    # Terminate the operation for the application
     oc exec <controller_pod_name> -n <argocd_namespace> -- argocd app terminate-op "$app" --core
  else
    echo "No revision mismatch for application: $app. No action needed."
  fi
done

@ghost
Copy link

ghost commented Jan 24, 2024

Great job! Any update?

Thanks!

@sherifabdlnaby
Copy link

I was surprised this is not the behavior in ArgoCD!
Any chance we can get this fixed? how can I help?

@sherifabdlnaby
Copy link

I think ArgoCD need two knobs to better handle Syncing.

  1. A Sync timeout feature so that jobs that are stuck indefinitely can eventually fail. ( Auto Sync terminate #16489 and Timeout for sync operations #6055 )
  2. A Flag to retry the newest version after a Sync has failed (most probably due to timeout) instead of trying the same version regardless of the retry count.

I believe this is the most intuitive approach and it's simpler because a Sync attempt is only attempting a single SHA; and you won't terminate syncs while they're running regardless they're stuck or not (a preSync Job can take a while but you don't really want to kill it because a new commit was pushed). And as a user I have control over the timeouts.

@phyzical
Copy link
Contributor

@sherifabdlnaby wouldn't 2 automagically occur after 1 happens?

As this is what happen if you click cancel manually.

I guess if we did need two paths it would be something like sync-timeout and sync-timeout-condition

where sync-timeout-condition defaults to 'cancel' instead of 'fail' to match current convention when this timeout issue occurs

@blixem777
Copy link

any progress on this?

@jlewi
Copy link
Author

jlewi commented Oct 9, 2024

To summarize this ticket. There is a PR open #15603 that is about 1 year old that IUC implements the approach agreed upon at the community meeting.
#11494 (comment)

It looks the PR is stuck on some failing tests from July of this year
#15603 (comment)
with a suggestion on how to fix the test.

So it seems like the path forward would be for someone to pick up that PR to get the tests to pass and then hopefully the ArgoCD maintainers will approve it.

@jastBytes
Copy link

I've updated the script by @jbartyze-rh (thx!) to work out with my ArgoCD v2.10.7 and skip apps that are in state "Synced":

#!/usr/bin/env bash

set -euo pipefail

# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.

# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

for app in $applications; do
  # Get the current status
  current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')

  if [[ "$current_status" == "Synced" ]]; then
    echo "Skipping application $app due to status is synced."
    continue
  fi

  echo "Processing application: $app"

  # Get the currently running sync revision
  current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')

  # Get the desired sync revision
  desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')

  echo "Current revision for $app: $current_revision"
  echo "Desired revision for $app: $desired_revision"

  # Skip the application if either the current or desired revision is empty
  if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
    echo "Skipping application $app due to missing revision information."
    continue
  fi

  # Compare the two revisions
  if [ "$current_revision" != "$desired_revision" ]; then
    echo "Revision mismatch detected for application: $app. Terminating operation."
    # Terminate the operation for the application
    kubectl exec argocd-application-controller-0 -n argocd -- argocd app terminate-op "$app" --core
  else
    echo "No revision mismatch for application: $app. No action needed."
  fi
done

@jastBytes
Copy link

jastBytes commented Oct 18, 2024

Here's another version to have it as a cronjob. All you need is a container image suitable to execute. In my case I've created one having the following Dockerfile:

FROM alpine:3.20.3

ARG ARGOCDCLI_VERSION=v2.12.5
ARG KUBECTL_VERSION=v1.30.5

# Add necessary tools
RUN apk add -u --no-cache curl openssl bash jq

RUN curl -SL https://github.com/argoproj/argo-cd/releases/download/$ARGOCDCLI_VERSION/argocd-linux-amd64 -o /usr/local/bin/argocd && chmod +x /usr/local/bin/argocd
RUN curl -SL https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl

# run as non-root
RUN addgroup -g 1000 -S argocd && adduser -u 1000 -S argocd -G argocd && chown -R argocd:argocd /home/argocd && chmod 0770 /home/argocd
USER argocd

WORKDIR /home/argocd
ENTRYPOINT ["/usr/local/bin/argocd"]

And here's the K8s part using the image above. This must be run in argocd namespace to have the serviceaccount present:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-terminate-operations
spec:
  schedule: "*/3 * * * *" # At every 3rd minute
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: terminate
            image: myfancyregsitry/argocdcli-image:v0.0.1
            command: ["/script/terminate.sh"]
            volumeMounts:
              - name: script
                mountPath: "/script"
          restartPolicy: Never
          automountServiceAccountToken: true
          serviceAccount: argocd-application-controller
          serviceAccountName: argocd-application-controller
          volumes:
            - name: script
              configMap:
                name: argocd-terminate-operations-script
                defaultMode: 0555
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-terminate-operations-script
data:
  terminate.sh: |
    #!/usr/bin/env bash

    set -euo pipefail

    # Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
    # skipping applications where revision information is empty.

    # Get list of all ArgoCD applications in the 'argocd' namespace
    applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

    for app in $applications; do
      # Get the current status
      current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')

      if [[ "$current_status" == "Synced" ]]; then
        echo "Skipping application $app due to synced status."
        continue
      fi

      echo "Processing application: $app"

      # Get the currently running sync revision
      current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')

      # Get the desired sync revision
      desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')

      # Skip the application if either the current or desired revision is empty
      if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
        echo "Skipping application $app due to missing revision information."
        continue
      fi

      # Compare the two revisions
      if [ "$current_revision" != "$desired_revision" ]; then
        echo "Current revision for $app: $current_revision"
        echo "Desired revision for $app: $desired_revision"
        echo "Revision mismatch detected for application: $app. Terminating operation."
        # Terminate the operation for the application
        argocd app terminate-op "$app" --core
      else
        echo "No revision mismatch for application: $app. No action needed."
      fi
    done

@jbartyze-rh
Copy link

One improvement I found after using this script for longer time in my current engagement is changing below.

oc/kubectl get application to oc/kubectl get application.argoproj.io

We faced some issues with overlap of the aliases of different CR, so it is good to be explicit here in API choice.

@andrii-korotkov-verkada
Copy link
Contributor

Configure ArgoCD for infinite retries (see Application.yaml provided below)

Why would you want to retry indefinitely? If some issues persist after multiple retries, they probably need addressing and retry would just keep failing.

I think #20816 may help with setting a sync timeout, which should resolve this even for infinite retries.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 17, 2024
@jbartyze-rh
Copy link

jbartyze-rh commented Nov 18, 2024

Sharing my use cases, could be different than the Author.

  • waiting for CRD's to being created and sync failing on syncing CR for that particular CRD(created by the operator outside of ArgoCD)

  • Eventual consistency pattern, where we expect some ArgoCD application to fail unless some dependency from other ArgoCD application or external dependency is satisfied.

At scale I imagine we want to avoid restarting the sync manually once we fix the underlying issue or writing automation that will restart the sync process when this automation is already part of ArgoCD.

That implies alerting for applications that are stuck progressing more than X minutes with ArgoCD Notifications or other tools.

  • Zero touch pattern - for customers that do not want to log-in to ArgoCD UI or want to decrease the need to go there. ArgoCD will always try to sync the desired state and if new commit is applied to desired state it will terminate the existing sync and try to apply new one. Some customers want ArgoCD to always try to sync desired state, same as Kubernetes controllers do and not give up.

But in summary:

At scale my current Customer does not want to go back and check every failed application and restart it to fail it again, because something is missing or maybe there is a specific sequence to ArgoCD application sync process that it depends on X application to be synced first. Infinite retries provides the way for ArgoCD to have eventual consistency pattern. The only missing part is movement to latest commit and not being stuck trying to sync old commit if there is a new one available.

@andrii-korotkov-verkada andrii-korotkov-verkada removed the more-information-needed Further information is requested label Nov 18, 2024
@andrii-korotkov-verkada
Copy link
Contributor

IIUC, if your sync fails after retries and new commit is available, refresh would be triggered and with auto-sync enabled the new sync would be triggered. Please, let me know if I'm missing something. If you don't expect new commits for a while, it might be better to orchestrate updates in steps, so that next steps have resources from previous steps ready.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 25, 2024
@gravufo
Copy link

gravufo commented Nov 25, 2024

What if the fact that another app was synced fixes the sync for this app but it has stopped even trying to sync because it's been too long? Nothing is gonna retrigger it unless you go manually which no one wants to do.
All we want is eventual consistency pattern with exponential backoff like everything else that runs in kubernetes. The pattern is heavily used and understood. How many more use-cases do you need?

@andrii-korotkov-verkada
Copy link
Contributor

ArgoCD seems opinionated in the way that relevant changes are made via Git commits relevant to a given application and dependencies on external stuff aren't really supported well. What makes it impossible to orchestrate external operations to be run before applying Argo changes?

@alita1991
Copy link

Hi, is there an alternative command for terminate-op via kubectl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.