-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying failed sync's block newer commits; how to achieve declarative, level based gitops semantics? #11494
Comments
My intuition is the same as yours. I feel like |
Any update here? Seems like pretty critical functionality.
In any case a fixed timeout is unsatisfactory. Upon a new commit being pushed, the auto-sync logic should immediately terminate any in-progress syncs and start a new one. For now the manual workaround is to click the |
I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 |
As mentioned in #15642 |
Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal. I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the |
We currently have external logic to terminate sync with the CLI too which is why I didn't continue on my PoC. The timeout can easily be made configurable since we have access to the App Spec within that context. |
Related: #15624 |
Yes, I'd agree. Right now, the same parameters are re-used for any follow up sync operation (i.e. target revision, all source parameters, etc), which makes retries useless in some scenarios. We probably need to perform a refresh before the next retry, and update the operation with any new information (new commit SHA, changed source parameters, etc). |
I have a fix for this, which I want to discuss in today's contributor's meeting |
Glad to hear you have an idea. Did anything come out of the meeting?
…On Thu, Sep 28, 2023, 21:14 jannfis ***@***.***> wrote:
I have a fix for this, which I want to discuss in today's contributor's
meeting
—
Reply to this email directly, view it on GitHub
<#11494 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVTC7G67J7QM4PQQFB7SDTX4WH6TANCNFSM6AAAAAASPBCTQU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Here is the recording: https://youtu.be/baIX9Bk6f5w?t=1173. Alex was absent to the meeting and we'd like his take on this. What I understand is we'd prefer to Terminate a sync when a new revision is found instead of mutating the current retry with the new revision. This way, the whole sync process is redone, history is consistent and hooks are replayed if they change for example. |
@aslafy-z Any update on this issue? We've been encountering this issue recently and it would be nice to have new commits roll out and fix failed syncs 👍 |
@mpiercy827 I'd love this PR to move forward. I removed the Draft tag and would love some reviews! @alexec @crenshaw-dev @jannfis please have a look! 🙏 |
Here is sample workaround script for openshift-gitops-operator 1.11 that translates to ArgoCD 2.9.2. I did not see it anywhere on the internet or in this issue, so leaving it here for someone who is struggling with this problem. Inject into cronjob with oc/kubectl and you are good to go.
|
Great job! Any update? Thanks! |
I was surprised this is not the behavior in ArgoCD! |
I think ArgoCD need two knobs to better handle Syncing.
I believe this is the most intuitive approach and it's simpler because a Sync attempt is only attempting a single SHA; and you won't terminate syncs while they're running regardless they're stuck or not (a preSync Job can take a while but you don't really want to kill it because a new commit was pushed). And as a user I have control over the timeouts. |
@sherifabdlnaby wouldn't 2 automagically occur after 1 happens? As this is what happen if you click cancel manually. I guess if we did need two paths it would be something like where |
any progress on this? |
To summarize this ticket. There is a PR open #15603 that is about 1 year old that IUC implements the approach agreed upon at the community meeting. It looks the PR is stuck on some failing tests from July of this year So it seems like the path forward would be for someone to pick up that PR to get the tests to pass and then hopefully the ArgoCD maintainers will approve it. |
I've updated the script by @jbartyze-rh (thx!) to work out with my ArgoCD v2.10.7 and skip apps that are in state "Synced": #!/usr/bin/env bash
set -euo pipefail
# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.
# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')
for app in $applications; do
# Get the current status
current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')
if [[ "$current_status" == "Synced" ]]; then
echo "Skipping application $app due to status is synced."
continue
fi
echo "Processing application: $app"
# Get the currently running sync revision
current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')
# Get the desired sync revision
desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')
echo "Current revision for $app: $current_revision"
echo "Desired revision for $app: $desired_revision"
# Skip the application if either the current or desired revision is empty
if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
echo "Skipping application $app due to missing revision information."
continue
fi
# Compare the two revisions
if [ "$current_revision" != "$desired_revision" ]; then
echo "Revision mismatch detected for application: $app. Terminating operation."
# Terminate the operation for the application
kubectl exec argocd-application-controller-0 -n argocd -- argocd app terminate-op "$app" --core
else
echo "No revision mismatch for application: $app. No action needed."
fi
done |
Here's another version to have it as a cronjob. All you need is a container image suitable to execute. In my case I've created one having the following Dockerfile: FROM alpine:3.20.3
ARG ARGOCDCLI_VERSION=v2.12.5
ARG KUBECTL_VERSION=v1.30.5
# Add necessary tools
RUN apk add -u --no-cache curl openssl bash jq
RUN curl -SL https://github.com/argoproj/argo-cd/releases/download/$ARGOCDCLI_VERSION/argocd-linux-amd64 -o /usr/local/bin/argocd && chmod +x /usr/local/bin/argocd
RUN curl -SL https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl
# run as non-root
RUN addgroup -g 1000 -S argocd && adduser -u 1000 -S argocd -G argocd && chown -R argocd:argocd /home/argocd && chmod 0770 /home/argocd
USER argocd
WORKDIR /home/argocd
ENTRYPOINT ["/usr/local/bin/argocd"] And here's the K8s part using the image above. This must be run in argocd namespace to have the serviceaccount present: ---
apiVersion: batch/v1
kind: CronJob
metadata:
name: argocd-terminate-operations
spec:
schedule: "*/3 * * * *" # At every 3rd minute
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: terminate
image: myfancyregsitry/argocdcli-image:v0.0.1
command: ["/script/terminate.sh"]
volumeMounts:
- name: script
mountPath: "/script"
restartPolicy: Never
automountServiceAccountToken: true
serviceAccount: argocd-application-controller
serviceAccountName: argocd-application-controller
volumes:
- name: script
configMap:
name: argocd-terminate-operations-script
defaultMode: 0555
---
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-terminate-operations-script
data:
terminate.sh: |
#!/usr/bin/env bash
set -euo pipefail
# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.
# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')
for app in $applications; do
# Get the current status
current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')
if [[ "$current_status" == "Synced" ]]; then
echo "Skipping application $app due to synced status."
continue
fi
echo "Processing application: $app"
# Get the currently running sync revision
current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')
# Get the desired sync revision
desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')
# Skip the application if either the current or desired revision is empty
if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
echo "Skipping application $app due to missing revision information."
continue
fi
# Compare the two revisions
if [ "$current_revision" != "$desired_revision" ]; then
echo "Current revision for $app: $current_revision"
echo "Desired revision for $app: $desired_revision"
echo "Revision mismatch detected for application: $app. Terminating operation."
# Terminate the operation for the application
argocd app terminate-op "$app" --core
else
echo "No revision mismatch for application: $app. No action needed."
fi
done |
One improvement I found after using this script for longer time in my current engagement is changing below. oc/kubectl get application to oc/kubectl get application.argoproj.io We faced some issues with overlap of the aliases of different CR, so it is good to be explicit here in API choice. |
Why would you want to retry indefinitely? If some issues persist after multiple retries, they probably need addressing and retry would just keep failing. I think #20816 may help with setting a sync timeout, which should resolve this even for infinite retries. |
Sharing my use cases, could be different than the Author.
At scale I imagine we want to avoid restarting the sync manually once we fix the underlying issue or writing automation that will restart the sync process when this automation is already part of ArgoCD. That implies alerting for applications that are stuck progressing more than X minutes with ArgoCD Notifications or other tools.
But in summary: At scale my current Customer does not want to go back and check every failed application and restart it to fail it again, because something is missing or maybe there is a specific sequence to ArgoCD application sync process that it depends on X application to be synced first. Infinite retries provides the way for ArgoCD to have eventual consistency pattern. The only missing part is movement to latest commit and not being stuck trying to sync old commit if there is a new one available. |
IIUC, if your sync fails after retries and new commit is available, refresh would be triggered and with auto-sync enabled the new sync would be triggered. Please, let me know if I'm missing something. If you don't expect new commits for a while, it might be better to orchestrate updates in steps, so that next steps have resources from previous steps ready. |
What if the fact that another app was synced fixes the sync for this app but it has stopped even trying to sync because it's been too long? Nothing is gonna retrigger it unless you go manually which no one wants to do. |
ArgoCD seems opinionated in the way that relevant changes are made via Git commits relevant to a given application and dependencies on external stuff aren't really supported well. What makes it impossible to orchestrate external operations to be run before applying Argo changes? |
Hi, is there an alternative command for terminate-op via kubectl? |
I originally raised this in #11276
I have since observed this behavior repeatedly and so I'm raising it as an issue.
Checklist:
argocd version
.Describe the bug
ArgoCD will keep retrying the same commit that fails to sync properly even if there is a newer commit that fixes the sync.
To Reproduce
NAMESPACETHATDOESNOTEXIST
doesn'texist
Expected behavior
I expect ArgoCD's semantics to be level-based, declarative.
Level based means I'd expect ArgoCD keeps retrying a failed sync until either that sync succeeds or there is a newer commit.
For example, suppose I have an application that is creating a custom resource and the CRD hasn't been created yet (and is created by a separate process) or similarly the namespace doesn't exist and is created by separate process. Level based to me implies ArgoCD will keep retrying periodically so that in the event those issues are resolved the application gets created.
Declarative means I expect the desired state of the world to be the latest commit on the branch. Retrying an older commit once there is a newer commit violates that commit. Importantly, my expectation for gitops is that if there is a problem with a configuration I can fix it by pushing a new commit; if broken commits can block newer commits unless terminated then it doesn't seem like I can fix broken configurations just by pushing a new commit.
Do those expectations align with ArgoCD's? If they do, have I somehow misconfigured ArgoCD in order to achieve those semantics?
Screenshots
Version
Here's my ArgoCD application
Here's an example of a YAML file that I included in my ArgoCD sync'd repository. It is not a K8s resource so it fails to be applied.
The text was updated successfully, but these errors were encountered: