-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArgoCD sync stuck in "waiting for healthy state of argoproj.io/Application/abc" after programmatically deleting an application #10561
Comments
This is the output of argocd app get on the "app of apps":
In particular, note the message on the problematic app:
Running
(i.e., the CLI equivalent of the Terminate button on the "stuck" sync) on the "app of apps" will fix this problem (until next time). |
Is anyone else running into this issue? Or is our use case just too exotic/crazy that no one does anything remotely similar? ;-) |
I am seeing exactly the same thing! We are using an app of apps with a few more layers essentially a config app then a child shared helm chart so you can be 4 layers deep i.e Kustomize -> app of apps helm -> templated ns, sealed secrets, argo app -> Final argo app. It frequently get stuck on app of apps helm chart, seems to run in the two apps below it fine and health checks pass then within a few seconds it deletes the app out then tries to run it in again and it gets a bit stuck: application.argoproj.io/redacted-source unchanged. Warning: Detected changes to resource redacted-source which is currently being deleted. If you terminate the app of app sync as you suggest it starts working. I feel that there are two bugs here:
Only other thing worth mentioning there is probably an argo cli running which has ran "/usr/local/bin/argocd app wait $appName --timeout 300 --insecure" so there might be side effects from that? |
Unsure if helpful but I googled my errors where the following:
Adding mem to the application-controller resolved this for me. |
We also see a similar issue where in the app of app pattern with sync waves in place of applications, if the application with highest priority is not in healthy state and deletion of master parent app is initiated, it is stuck unless we terminate the already in progress sync (waiting for healthy state of the application) |
I think we're hitting this too when removing a child app in an app-of-apps. We do an |
Ah, I added a sleep between the |
Ha! And I just found my issue on this from 2021. #5675 |
Hi,
we are observing a problem with our ArgoCD-based deployment (originally we used 2.0.3, but updated to the latest 2.4.11, where we still observe the same behavior). What we do is probably a bit unorthodox, but let me first try to briefly explain what we do and how it fails - if you are interested in the "Why", I wrote a bit about that at the end of the description. Maybe there are much better ways to do what we want to do - I would also be happy to hear about such proposals.
We have an ArgoCD application that only consists of a single K8s manifest that creates a namespace (the "namespace app"). With another ArgoCD application we deploy our actual "product" (the "product app", which uses a Helm chart) into this namespace. Optionally, there are also some more helper apps for each product app. Together, the namespace app, product app, and the helper apps form what we call a "slot".
This approach works fine for the initial deployment. Note that we not only have one such pair of namespace app and product app in our cluster, but many. All these apps are subresources of a single "app of apps"/"master app", which is set to autosync.
What we now want to do is being able to "reset" the application (i.e., not just roll out an update, but have a completely fresh deployment, in particular empty persistence etc.). Our approach to do this is to use the K8s API do delete these two ArgoCD applications (product & namespace app, plus any "helpers"). Note that we explicitly want to include the namespace app into the deletion as a number of resources are created in that namespace that do not result directly from the Helm chart (e.g., PVCs created via STS volume claim templates). These resources need to be deleted as well to have a really fresh application state, and since we do not want to explicitly find or list all these resources in our cleaning code, we hoped to simply delete them indirectly by deleting the namespace app, which would then cause the deletion of it's only resource (the namespace), which would then delete all the namespaced resources it contains. We originally just deleted the namespace directly and did not touch the ArgoCD apps themselves, but that would cause problems with updating of non-namespaced resources (e.g., ServiceAccounts, ClusterRoles, StorageClasses). By deleting the app, we want to capture these resources as well.
The idea is that after deleting all of a slot's apps, the "app of apps" - being set to auto sync - will re-create the slot apps we just deleted (but of course with the new image versions/settings/etc.)
This approach works in principle. Sometimes, however it fails, and we end up with our "app of apps" in an "out of sync"/"syncing" state, and the sync progress is stuck in "waiting for healthy state of NamespaceApp", see the screenshot below. In the UI the namespace app is shown in "missing" state (yellow Pacman ghost icon), as is the product app (and some auxiliary apps that would also deploy resources into the same namespace as the product app itself).
When looking at the apps with kubectl, neither product nor namespace app are shown at all:
We scaled down the argocd-server deployment to simplify log collection. When the issues occurs, this is what it logs (dts-az-we-1 is the name of the "master app"):
Our guesswork on what could cause this state didn't get us too far - one theory is that since the namespace contains a lot of resources (in particular pods) which take a while to terminate, the namespace itself stays in "Terminating" state for a while. Only when it is finally gone, the namespace app (which is in "Progressing" state during that time) also disappears. Perhaps at this point ArgoCD is making a call to check the state of the namespace app, which now is no longer existing, and this is what that code does not take into consideration and is what then derails the sync of the master app which then does not progress anymore.
To get the cluster out of this state, the only remedy we found is to access the Argo UI, and terminate the sync (there is probably an API call for that, but we are not yet desperate enough to automate a workaround ;-)).
Any help appreciated. If you think that what we do is incredibly complicated (we think it is, but once you start solving one problem after the other by adding yet another ArgoCD app, it is hard to stop... ;-)) and there are incredibly better ways to achieve what we want, please share them.
Regards
J
PS: A few about the motivation behind out "unorthodox" method, to give some context, and maybe bring forward some suggestions for alternative ways on how to solve what we try here - as I hope it became clear in my writeup above, we also find the approach somewhat "awkward" to say the least, but didn't come up with anything better yet, in particular considering some of the other constraints or requirements we want to cover.
What we are attempting with the above sketch (ab)use of ArgoCD is to automate the deployment of test systems of our application in our CI/CD pipeline. The idea is to have one (or a few) clusters, to which the respective versions of the application are then deployed into what we call "slots".
With the slots we solve a few technical challenges (not directly related to ArgoCD, but to our inhouse cloud platform, of which ArgoCD is just one part) and the slots also allow us to allocate resources to different teams ("slots 1 and 2 belong to team A, slots 3 to 6 belong to team B, ..." - just as it was with physical and later virtualized servers in the pre-cloud days)
For each slot, we generate files that describe the slot's ArgoCD applications (product, namespace, auxiliaries) and additional files (like Helm chart "values files", additional K8s manifests etc. The result is then processed with Kapitan and submitted to our GitOps repo, from where ArgoCD is meant to watch for the changes and apply them to the cluster.
The problem mentioned above results from our approach to attempt to "clean" a slot, in order to reuse it. We oftentimes do not want to just "update" the deployment in a slot (e.g., deploy new versions of an image), but want a fresh and "unspoiled" application. So the idea was to delete a slot's Argo applications and recreate them (by autosync of the master app).
We however wanted to avoid two commits to our GitOps repo (one that would delete the slot apps, the other that would recreate them). One reason was that we didn't want to implement a "wait for deletion to complete". We also wanted the deletion to happen inside the cluster, primarily to avoid concurrency issues by parallel submits. This is why we have the actual cleaner apps that delete a slot's ArgoCD apps also as apps in the cluster (one cleaner per slot). The cleaner is a simple deployment (one pod) that uses the K8s API to delete the apps (but the cleaner itself), waits for the deletion to complete (yes, we didn't manage to avoid that, but it is much easier to do inside the cluster) and only then goes to the ready state. Using ArgoCD's sync-wave feature, we make sure that the cleaner syncs first. Whenever a slot is refreshed, we just update the cleaner app with a new one (so that the pod restarts and cleans again).
Note that we are well aware that for what the cleaner is expected to do, this actually cries for "use a K8s job for this", and this is what we did at first (since you cannot update a Job's podspec, we simply created a job under a new name in the slot's cleaner namespace), but we observed that the sync waves feature didn't work that well with that approach, probably because this app was only a job that already started in a "Completed" state (from the previous cleaning).
The text was updated successfully, but these errors were encountered: