Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applications from ApplicationSet flip rapidly between "Unknown" and "Synchronised" #16260

Closed
3 tasks done
rbowater opened this issue Nov 7, 2023 · 19 comments
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@rbowater
Copy link

rbowater commented Nov 7, 2023

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Since upgrading from ArgoCD 2.8.4 to 2.9.0 our applications that have been generated via ApplicationSets are constantly flapping multiple times a second between "Synchronised" and "Unknown" in the UI. From what I can tell from diffing the generated application as per https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#finding-resources-to-ignore, the sync status and repoURL under status -> sync is constantly flapping between "" and the desired value. I've included the logs further down.

The outcome of this is that argoCD essentially hammers itself with this constant and rapid flipping between states. I've included logs from the application controller which illustrates this behaviour.

We used argoCD autopilot to generate our applicationsets last year and I have found removing ignoreDifferences from the applicationset template spec stops the flapping. I'm not sure if this is expected behaviour, as creating an application directly with ignoreDifferences configured doesn't seem to do this.

To Reproduce

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"
  creationTimestamp: null
  name: cluster-resources
  namespace: argocd
spec:
  generators:
    - git:
        files:
          - path: kubernetes/bootstrap/cluster-resources/*.json
        repoURL: github.com/***
        requeueAfterSeconds: 20
        revision: main
        template:
          metadata: {}
          spec:
            destination: {}
            project: ""
            source:
              repoURL: ""
  syncPolicy:
    preserveResourcesOnDeletion: true
  template:
    metadata:
      labels:
        app.kubernetes.io/managed-by: argocd-autopilot
        app.kubernetes.io/name: cluster-resources-{{name}}
      name: cluster-resources-{{name}}
      namespace: argocd
    spec:
      destination:
        server: "{{server}}"
      # Removing this stops the flapping
      ignoreDifferences:
        - group: argoproj.io
          jsonPointers:
            - /status
          kind: Application
      project: default
      source:
        path: kubernetes/bootstrap/cluster-resources/{{name}}
        repoURL: https://github.com/***
        targetRevision: main
      syncPolicy:
        automated:
          allowEmpty: true
          selfHeal: true
status: {}

Where the cluster resources directory contains a file called "in-cluster.json":

{"name":"in-cluster","server":"https://kubernetes.default.svc"}

and a folder called "in-cluster" that contains a namespace definition (argocd-ns.yaml):

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Prune=false
  creationTimestamp: null
  name: argocd

This isn't the only applicationset where this is happening, but was the most straight forward reproduction case for us.

Expected behavior

We expect the applications to remain in Synchronised status

Screenshots

Version

argocd version
argocd: v2.9.0+9cf0c69
  BuildDate: 2023-11-06T04:43:50Z
  GitCommit: 9cf0c69bbe70393db40e5755e34715f30179ee09
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64

Logs

Application controller. This shows that in a second it's repeatedly going Updated sync status: -> Synced:

time="2023-11-07T10:42:04Z" level=info msg="Updated sync status:  -> Synced" application=cluster-resources-in-cluster dest-namespace= dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal
time="2023-11-07T10:42:04Z" level=info msg="Update successful" application=argocd/cluster-resources-in-cluster
time="2023-11-07T10:42:04Z" level=debug msg="Requesting app refresh caused by object update" api-version=argoproj.io/v1alpha1 application=argocd/autopilot-bootstrap cluster-name= fields.level=0 kind=Application name=cluster-resources-in-cluster namespace=argocd server="https://kubernetes.default.svc"
time="2023-11-07T10:42:04Z" level=info msg="Reconciliation completed" application=argocd/cluster-resources-in-cluster dedup_ms=0 dest-name= dest-namespace= dest-server="https://kubernetes.default.svc" diff_ms=2 fields.level=3 git_ms=289 health_ms=0 live_ms=0 patch_ms=11 setop_ms=0 settings_ms=0 sync_ms=0 time_ms=316
time="2023-11-07T10:42:04Z" level=info msg="Refreshing app status (controller refresh requested), level (0)" application=argocd/autopilot-bootstrap
time="2023-11-07T10:42:04Z" level=info msg="No status changes. Skipping patch" application=argocd/autopilot-bootstrap
time="2023-11-07T10:42:04Z" level=info msg="Reconciliation completed" application=argocd/autopilot-bootstrap dest-name= dest-namespace=argocd dest-server="https://kubernetes.default.svc" fields.level=0 patch_ms=0 setop_ms=0 time_ms=6
time="2023-11-07T10:42:04Z" level=debug msg="Requesting app refresh caused by object update" api-version=argoproj.io/v1alpha1 application=argocd/autopilot-bootstrap cluster-name= fields.level=0 kind=Application name=cluster-resources-in-cluster namespace=argocd server="https://kubernetes.default.svc"
time="2023-11-07T10:42:04Z" level=info msg="Refreshing app status (spec.source differs), level (3)" application=argocd/cluster-resources-in-cluster
time="2023-11-07T10:42:04Z" level=info msg="Refreshing app status (controller refresh requested), level (0)" application=argocd/autopilot-bootstrap
time="2023-11-07T10:42:04Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: )" application=argocd/cluster-resources-in-cluster
time="2023-11-07T10:42:04Z" level=debug msg="Generating Manifest for source {https://github.com/*** kubernetes/bootstrap/cluster-resources/in-cluster 2.9-speculative-fix nil nil nil nil  } revision 2.9-speculative-fix"
time="2023-11-07T10:42:04Z" level=info msg="No status changes. Skipping patch" application=argocd/autopilot-bootstrap
time="2023-11-07T10:42:04Z" level=info msg="Reconciliation completed" application=argocd/autopilot-bootstrap dest-name= dest-namespace=argocd dest-server="https://kubernetes.default.svc" fields.level=0 patch_ms=0 setop_ms=0 time_ms=7
time="2023-11-07T10:42:05Z" level=info msg="getRepoObjs stats" application=argocd/cluster-resources-in-cluster build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=298 unmarshal_ms=297 version_ms=0
time="2023-11-07T10:42:05Z" level=debug msg="Retrieved live manifests" application=argocd/cluster-resources-in-cluster
time="2023-11-07T10:42:05Z" level=info msg="Skipping auto-sync: application status is Synced" application=argocd/cluster-resources-in-cluster
time="2023-11-07T10:42:05Z" level=info msg="Updated sync status:  -> Synced" application=cluster-resources-in-cluster dest-namespace= dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal

Diff of the application CR. It seems to be rapidly switching between:

  sync:
    comparedTo:
      destination:
        server: https://kubernetes.default.svc
      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status
        kind: Application
      source:
        path: kubernetes/bootstrap/cluster-resources/in-cluster
        repoURL: ""
        targetRevision: 2.9-speculative-fix
    revision: cda88b740ba847be6bb94172834e4b6971099956
    status: ""



  sync:
    comparedTo:
      destination:
        server: https://kubernetes.default.svc
      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status
        kind: Application
      source:
        path: kubernetes/bootstrap/cluster-resources/in-cluster
        repoURL: https://github.com/***
        targetRevision: 2.9-speculative-fix
    revision: cda88b740ba847be6bb94172834e4b6971099956
    status: Synced
@rbowater rbowater added the bug Something isn't working label Nov 7, 2023
@rumstead
Copy link
Member

rumstead commented Nov 9, 2023

Why are you ignoring the status of the applications?

      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status

@ddevaal
Copy link

ddevaal commented Nov 9, 2023

We are running into the same problem. The bug is present in all versions from 2.8.4 to 2.9.0. We are using the same type of setup as @rbowater.

@rumstead
Copy link
Member

rumstead commented Nov 9, 2023

Why are you ignoring the status of the applications?

      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status

With the status ignored, the application's .status.sync.status flicks from Synced to "". The UI displays the change until the status is restored.

It looks like the applicationset controller is blowing away the application status because it's ignored then the application controller restores it. #14743 is when the applicationset code was introduced but it starts in 2.9.0.

Interestingly enough, there was some refactoring to that code in #15965 which was put in v2.9.1. I feel the issue will still be there from my glance but maybe worth a test.

@ddevaal
Copy link

ddevaal commented Nov 10, 2023

Why are you ignoring the status of the applications?

      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status

argo-cd autopilot adds that to almost everything. Is it safe to remove?

@ddevaal
Copy link

ddevaal commented Nov 10, 2023

I have updated to the latest ArgoCD CRD's (manifest dir in the 2.9.0 release) and now CPU went back to normal. Maybe try that too.

@rumstead
Copy link
Member

Why are you ignoring the status of the applications?

      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status

argo-cd autopilot adds that to almost everything. Is it safe to remove?

If you are committing the status as part of the spec into Git, I suppose there is a chance that Argo sees that as out of sync. You can remove status from the CRs you have committed in Git.

@Hronom
Copy link
Contributor

Hronom commented Nov 10, 2023

Same here, all good in 2.8.6, but after upgrade it to 2.9.0 it started flickering

@Hronom
Copy link
Contributor

Hronom commented Nov 11, 2023

cc @crenshaw-dev

@flickers
Copy link
Contributor

We are seeing the same issue but we don't have ignoreDifferences on /status
So this might not be related to that.

@thober35
Copy link

Could relate to #15299
We are only seeing this for Applicationsets without syncStrategy

@flickers
Copy link
Contributor

Could relate to #15299 We are only seeing this for Applicationsets without syncStrategy

Are you talking about that you have spec.syncPolicy.applicationSync defined in some applicationSets and then this doesn't happen?

https://argo-cd.readthedocs.io/en/latest/operator-manual/applicationset/Controlling-Resource-Modification/#managed-applications-modification-policies

If so what value are you using?
We are using the following in all applicationSets

spec:
  syncPolicy:
    applicationSync: {}

Which I presume that would translate to this as sync is the default as the --policy is not defined as stated in the documentation.

spec:
  syncPolicy:
    applicationSync: sync

@thober35
Copy link

No, I am talking about the RollingSync / Progressive Sync feature:
https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Progressive-Syncs/

For ApplicationSets with spec.strategy.type set to RollingSync, this issue does not occur on our side, while we still observe this issue for other ApplicationSets.

@flickers
Copy link
Contributor

Could this #16085 be the culprit?
I tried controller.repo.error.grace.period.seconds: "180" in argocd-cmd-params-cm (configmap) but I'm still seeing this issue though.

Would be nice to get some comments from the ArgoCD Maintainers on this. This seems to be most active issue since argocd 2.9.0
https://github.com/argoproj/argo-cd/issues?q=is%3Aissue+is%3Aopen+sort%3Acomments-desc+created%3A%3E2023-11-06

@crenshaw-dev
Copy link
Member

I think we're seeing this at Intuit, too. Looking into it...

@crenshaw-dev
Copy link
Member

This reproduces the issue in 2.9.0:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: bug-16260
  namespace: argocd
spec:
  generators:
   - git:
       directories:
         - path: helm-guestbook
       repoURL: https://github.com/argoproj/argocd-example-apps.git
       requeueAfterSeconds: 10
       revision: master
  template:
    metadata:
      name: bug-16260
    spec:
      project: default
      source:
        repoURL: https://github.com/argoproj/argocd-example-apps.git
        path: helm-guestbook
      destination:
        server: https://kubernetes.default.svc
        namespace: default
      ignoreDifferences:
      - group: argoproj.io
        jsonPointers:
        - /status
        kind: Application

@crenshaw-dev
Copy link
Member

I can't reproduce the issue on release-2.9 now that #16299 is merged. In the interest of time, I'll skip the deep-dive into why this bug exists and instead cut 2.9.1. If other weirdness appears, we'll tackle that as it comes. :-) Thanks everyone for your patience and help! I'll post here again when 2.9.1 is out.

@math3vz
Copy link

math3vz commented Nov 14, 2023

@crenshaw-dev do we have an ETA on 2.9.1?

Just saw it got released 15 minutes ago, thanks a lot!

@flickers
Copy link
Contributor

flickers commented Nov 14, 2023

v2.9.1 seems to have fixed this issue for us!
Thanks a lot @crenshaw-dev 👍
I also see that the Repo Server is also back to normal as I did see some anomalies there that I initially failed to mention.
image

@rbowater
Copy link
Author

We've rolled this out too and I agree that it seems to be fixed in 2.9.1. Thanks a lot for picking this up so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants