Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD sync stuck in "waiting for healthy state of argoproj.io/Application/abc" after programmatically deleting an application #10561

Open
jgoeres opened this issue Sep 9, 2022 · 8 comments
Labels
bug Something isn't working sync-waves

Comments

@jgoeres
Copy link

jgoeres commented Sep 9, 2022

Hi,

we are observing a problem with our ArgoCD-based deployment (originally we used 2.0.3, but updated to the latest 2.4.11, where we still observe the same behavior). What we do is probably a bit unorthodox, but let me first try to briefly explain what we do and how it fails - if you are interested in the "Why", I wrote a bit about that at the end of the description. Maybe there are much better ways to do what we want to do - I would also be happy to hear about such proposals.

We have an ArgoCD application that only consists of a single K8s manifest that creates a namespace (the "namespace app"). With another ArgoCD application we deploy our actual "product" (the "product app", which uses a Helm chart) into this namespace. Optionally, there are also some more helper apps for each product app. Together, the namespace app, product app, and the helper apps form what we call a "slot".
This approach works fine for the initial deployment. Note that we not only have one such pair of namespace app and product app in our cluster, but many. All these apps are subresources of a single "app of apps"/"master app", which is set to autosync.

What we now want to do is being able to "reset" the application (i.e., not just roll out an update, but have a completely fresh deployment, in particular empty persistence etc.). Our approach to do this is to use the K8s API do delete these two ArgoCD applications (product & namespace app, plus any "helpers"). Note that we explicitly want to include the namespace app into the deletion as a number of resources are created in that namespace that do not result directly from the Helm chart (e.g., PVCs created via STS volume claim templates). These resources need to be deleted as well to have a really fresh application state, and since we do not want to explicitly find or list all these resources in our cleaning code, we hoped to simply delete them indirectly by deleting the namespace app, which would then cause the deletion of it's only resource (the namespace), which would then delete all the namespaced resources it contains. We originally just deleted the namespace directly and did not touch the ArgoCD apps themselves, but that would cause problems with updating of non-namespaced resources (e.g., ServiceAccounts, ClusterRoles, StorageClasses). By deleting the app, we want to capture these resources as well.

The idea is that after deleting all of a slot's apps, the "app of apps" - being set to auto sync - will re-create the slot apps we just deleted (but of course with the new image versions/settings/etc.)

This approach works in principle. Sometimes, however it fails, and we end up with our "app of apps" in an "out of sync"/"syncing" state, and the sync progress is stuck in "waiting for healthy state of NamespaceApp", see the screenshot below. In the UI the namespace app is shown in "missing" state (yellow Pacman ghost icon), as is the product app (and some auxiliary apps that would also deploy resources into the same namespace as the product app itself).

image

When looking at the apps with kubectl, neither product nor namespace app are shown at all:

$ kubectl get applications.argoproj.io  -n <appNamespace>
NAME                                  SYNC STATUS   HEALTH STATUS
[...]
slot3-cleaner                         Synced        Healthy
slot3-cleaner-namespaces              Synced        Healthy
slot3-product-prerequisites           Synced        Healthy
[...]

We scaled down the argocd-server deployment to simplify log collection. When the issues occurs, this is what it logs (dts-az-we-1 is the name of the "master app"):

time="2022-09-09T14:06:07Z" level=info msg="received unary call /application.ApplicationService/GetResource" grpc.method=GetResource grpc.request.claims="{\"exp\":1662810505,\"iat\":1662724105,\"iss\":\"argocd\",\"jti\":\"b52424e8-bf87-41d1-aec7-7861e013f3cd\",\"nbf\":1662724105,\"sub\":\"admin\"}" grpc.request.content="name:\"dts-az-we-1\" namespace:\"management\" resourceName:\"slot3-namespace-4-product\" version:\"v1alpha1\" group:\"argoproj.io\" kind:\"Application\" " grpc.service=application.ApplicationService grpc.start_time="2022-09-09T14:06:07Z" span.kind=server system=grpc
time="2022-09-09T14:06:07Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = Application argoproj.io slot3-namespace-4-product not found as part of application dts-az-we-1" grpc.code=InvalidArgument grpc.method=GetResource grpc.service=application.ApplicationService grpc.start_time="2022-09-09T14:06:07Z" grpc.time_ms=15.507 span.kind=server system=grpc
time="2022-09-09T14:06:07Z" level=info msg="received unary call /account.AccountService/CanI" grpc.method=CanI grpc.request.claims="{\"exp\":1662810505,\"iat\":1662724105,\"iss\":\"argocd\",\"jti\":\"b52424e8-bf87-41d1-aec7-7861e013f3cd\",\"nbf\":1662724105,\"sub\":\"admin\"}" grpc.request.content="resource:\"logs\" action:\"get\" subresource:\"default/dts-az-we-1\" " grpc.service=account.AccountService grpc.start_time="2022-09-09T14:06:07Z" span.kind=server system=grpc
time="2022-09-09T14:06:07Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=CanI grpc.service=account.AccountService grpc.start_time="2022-09-09T14:06:07Z" grpc.time_ms=7.568 span.kind=server system=grpc

Our guesswork on what could cause this state didn't get us too far - one theory is that since the namespace contains a lot of resources (in particular pods) which take a while to terminate, the namespace itself stays in "Terminating" state for a while. Only when it is finally gone, the namespace app (which is in "Progressing" state during that time) also disappears. Perhaps at this point ArgoCD is making a call to check the state of the namespace app, which now is no longer existing, and this is what that code does not take into consideration and is what then derails the sync of the master app which then does not progress anymore.

To get the cluster out of this state, the only remedy we found is to access the Argo UI, and terminate the sync (there is probably an API call for that, but we are not yet desperate enough to automate a workaround ;-)).

Any help appreciated. If you think that what we do is incredibly complicated (we think it is, but once you start solving one problem after the other by adding yet another ArgoCD app, it is hard to stop... ;-)) and there are incredibly better ways to achieve what we want, please share them.

Regards

J

PS: A few about the motivation behind out "unorthodox" method, to give some context, and maybe bring forward some suggestions for alternative ways on how to solve what we try here - as I hope it became clear in my writeup above, we also find the approach somewhat "awkward" to say the least, but didn't come up with anything better yet, in particular considering some of the other constraints or requirements we want to cover.

What we are attempting with the above sketch (ab)use of ArgoCD is to automate the deployment of test systems of our application in our CI/CD pipeline. The idea is to have one (or a few) clusters, to which the respective versions of the application are then deployed into what we call "slots".
With the slots we solve a few technical challenges (not directly related to ArgoCD, but to our inhouse cloud platform, of which ArgoCD is just one part) and the slots also allow us to allocate resources to different teams ("slots 1 and 2 belong to team A, slots 3 to 6 belong to team B, ..." - just as it was with physical and later virtualized servers in the pre-cloud days)
For each slot, we generate files that describe the slot's ArgoCD applications (product, namespace, auxiliaries) and additional files (like Helm chart "values files", additional K8s manifests etc. The result is then processed with Kapitan and submitted to our GitOps repo, from where ArgoCD is meant to watch for the changes and apply them to the cluster.

The problem mentioned above results from our approach to attempt to "clean" a slot, in order to reuse it. We oftentimes do not want to just "update" the deployment in a slot (e.g., deploy new versions of an image), but want a fresh and "unspoiled" application. So the idea was to delete a slot's Argo applications and recreate them (by autosync of the master app).
We however wanted to avoid two commits to our GitOps repo (one that would delete the slot apps, the other that would recreate them). One reason was that we didn't want to implement a "wait for deletion to complete". We also wanted the deletion to happen inside the cluster, primarily to avoid concurrency issues by parallel submits. This is why we have the actual cleaner apps that delete a slot's ArgoCD apps also as apps in the cluster (one cleaner per slot). The cleaner is a simple deployment (one pod) that uses the K8s API to delete the apps (but the cleaner itself), waits for the deletion to complete (yes, we didn't manage to avoid that, but it is much easier to do inside the cluster) and only then goes to the ready state. Using ArgoCD's sync-wave feature, we make sure that the cleaner syncs first. Whenever a slot is refreshed, we just update the cleaner app with a new one (so that the pod restarts and cleans again).
Note that we are well aware that for what the cleaner is expected to do, this actually cries for "use a K8s job for this", and this is what we did at first (since you cannot update a Job's podspec, we simply created a job under a new name in the slot's cleaner namespace), but we observed that the sync waves feature didn't work that well with that approach, probably because this app was only a job that already started in a "Completed" state (from the previous cleaning).

@jgoeres
Copy link
Author

jgoeres commented Sep 12, 2022

This is the output of argocd app get on the "app of apps":

Name:               dts-az-we-1
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          dts-az-we-1
URL:                xxxxxxxxxxxxxxxxxxxxxx
Repo:               xxxxxxxxxxxxxxxxxxxxxx
Target:             master
Path:               compiled/dts-az-we-1/applications/argocd
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        OutOfSync from master (e1c669c)
Health Status:      Healthy

GROUP        KIND         NAMESPACE    NAME                                 STATUS     HEALTH   HOOK     MESSAGE
             Namespace    dts-az-we-1  management                           Succeeded  Synced            namespace/management unchanged
[...]
argoproj.io  Application  management   slot3-cleaner                        Synced                       application.argoproj.io/slot3-cleaner unchanged
[...]
argoproj.io  Application  management   slot3-namespace-4-product            OutOfSync  Missing           application.argoproj.io/slot3-namespace-4-product unchanged. Warning: Detected changes to resource slot3-namespace-4-product which is currently being deleted.
[...]
argoproj.io  Application  management   slot3-product-prerequisites          OutOfSync  Missing           application.argoproj.io/slot3-product-prerequisites created
[...]
argoproj.io  Application  management   slot3-integrated                     OutOfSync  Missing
argoproj.io  Application  management   slot3-xxxxxxxxxxxxxx                 OutOfSync  Missing
[...]

In particular, note the message on the problematic app:

Warning: Detected changes to resource slot3-namespace-4-product which is currently being deleted.

Running

argocd app terminate-op

(i.e., the CLI equivalent of the Terminate button on the "stuck" sync) on the "app of apps" will fix this problem (until next time).

@jgoeres
Copy link
Author

jgoeres commented Oct 7, 2022

Is anyone else running into this issue? Or is our use case just too exotic/crazy that no one does anything remotely similar? ;-)
We meanwhile went ahead and implemented a workaround - a simple script in a pod waits for the "master app" to sync, with a timeout (300sec). If the timeout occurs, we run argocd app terminate-op.
This has been running for 17 days now, and logged >120 occurrences of a stuck sync.

@AndyBan
Copy link

AndyBan commented Oct 12, 2022

I am seeing exactly the same thing!

We are using an app of apps with a few more layers essentially a config app then a child shared helm chart so you can be 4 layers deep i.e Kustomize -> app of apps helm -> templated ns, sealed secrets, argo app -> Final argo app.

It frequently get stuck on app of apps helm chart, seems to run in the two apps below it fine and health checks pass then within a few seconds it deletes the app out then tries to run it in again and it gets a bit stuck:

application.argoproj.io/redacted-source unchanged. Warning: Detected changes to resource redacted-source which is currently being deleted.

If you terminate the app of app sync as you suggest it starts working.

I feel that there are two bugs here:

  1. There is some sort of caching issue? It shouldn't be deleting apps out when mid sync. I've never seen it remove apps out apart from in the first minute or so. We have prune enabled on our app of apps helm chart, I will turn that off and see if it helps.

  2. Sync is waiting for a healthy state of a deleting app which is never going to work. If the app is being deleted it should wait for deleted surely?

Only other thing worth mentioning there is probably an argo cli running which has ran "/usr/local/bin/argocd app wait $appName --timeout 300 --insecure" so there might be side effects from that?

@smark88
Copy link
Contributor

smark88 commented Jan 17, 2023

Unsure if helpful but I googled not found as part of application to bring me to this thread, as my applications where not syncing with similar conditions and it turned out the application-controller hit my memory limit and OOMkilled.

my errors where the following:

{"error":"rpc error: code = InvalidArgument desc = ClusterExternalSecret external-secrets.io dockerhub-docker-secret not found as part of application external-secrets-gcp-stag01-us-east-4","grpc.code":"InvalidArgument","grpc.method":"GetResource","grpc.service":"application.ApplicationService","grpc.start_time":"2023-01-17T15:46:17Z","grpc.time_ms":2.542,"level":"info","msg":"finished unary call with code InvalidArgument","span.kind":"server","system":"grpc","time":"2023-01-17T15:46:17Z"}


{"grpc.method":"ManagedResources","grpc.request.claims":"{\"at_hash\":\"REH0TNv53oSyTkgJyBPb6A\",\"aud\":\"argocd-influxdata\",\"c_hash\":\"rdaCL2vQ04V_VmK6dSmrcw\",\"email\":\"[email protected]\",\"email_verified\":true,\"exp\":1673971191,\"groups\":[\"[email protected]\",\"[email protected]\",\"[email protected]\",\"[email protected]\"],\"iat\":1673967591,\"iss\":\"https://temp.temp-us-east-1.aws.influxdata.io\",\"name\":\"Mark\",\"sub\":\"ChUxMTI1ODc1MDI0NjM0NjYyMDIxMzYSBmdvb2dsZQ\"}","grpc.request.content":{"applicationName":"external-secrets-gcp-stag01-us-east-4","name":"gar-docker-secret","group":"external-secrets.io","kind":"ClusterExternalSecret"},"grpc.service":"application.ApplicationService","grpc.start_time":"2023-01-17T15:47:55Z","level":"info","msg":"received unary call /application.ApplicationService/ManagedResources","span.kind":"server","system":"grpc","time":"2023-01-17T15:47:55Z"}

Adding mem to the application-controller resolved this for me.

@ShahroZafar
Copy link

We also see a similar issue where in the app of app pattern with sync waves in place of applications, if the application with highest priority is not in healthy state and deletion of master parent app is initiated, it is stuck unless we terminate the already in progress sync (waiting for healthy state of the application)

@michaelajr
Copy link

I think we're hitting this too when removing a child app in an app-of-apps.

We do an argocd app sync --prune of the app-of-apps, then an argocd app wait -l argocd.argoproj.io/instance=app-of-apps... and the wait is waiting on pruned child app resources. Any advice on how to fix this?

@michaelajr
Copy link

michaelajr commented Oct 7, 2024

Ah, I added a sleep between the argocd app sync --prune and the app wait -l argocd.argoproj.io/instance=app-of-apps and that worked - so it seems to be a race. I think wait needs an --ignore-pruned option.

@michaelajr
Copy link

michaelajr commented Oct 7, 2024

Ha! And I just found my issue on this from 2021. #5675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sync-waves
Projects
None yet
Development

No branches or pull requests

7 participants