Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progressDeadlineAbort causes degraded Rollout outside of a deployment #1624

Closed
bpoland opened this issue Nov 4, 2021 · 13 comments · Fixed by #1649
Closed

progressDeadlineAbort causes degraded Rollout outside of a deployment #1624

bpoland opened this issue Nov 4, 2021 · 13 comments · Fixed by #1649
Assignees
Labels
bug Something isn't working cherry-pick/release-1.1

Comments

@bpoland
Copy link
Contributor

bpoland commented Nov 4, 2021

Summary

It seems that with progressDeadlineAbort: true that some Rollouts randomly get marked as degraded when no deployment is in progress. This has been happening repeatedly since I turned on progressDeadlineAbort (and stops happening when I turn that back off).

When the issue happens, the Rollout is left like this:

Name:            name
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 216
Strategy:        Canary
  Step:          20/20
  SetWeight:     0
  ActualWeight:  100
Images:          image:tag (stable)
Replicas:
  Desired:       3
  Current:       3
  Updated:       3
  Ready:         3
  Available:     3

NAME                                    KIND         STATUS        AGE  INFO
⟳ name                                 Rollout      ✖ Degraded    22d
├──# revision:216
│  ├──⧉ name-7bc889bb44                ReplicaSet   ✔ Healthy     64m  stable
│  │  ├──□ name-7bc889bb44-79cvk       Pod          ✔ Running     63m  ready:2/2
│  │  ├──□ name-7bc889bb44-5nwqw       Pod          ✔ Running     59m  ready:2/2
│  │  └──□ name-7bc889bb44-nqqt8       Pod          ✔ Running     53m  ready:2/2
│  └──α name-7bc889bb44-216            AnalysisRun  ✔ Successful  64m  ✔ 92
├──# revision:215
│  ├──⧉ name-5f5d69cb8                 ReplicaSet   • ScaledDown  98m
│  └──α name-5f5d69cb8-215             AnalysisRun  ✔ Successful  98m  ✔ 108
└──# revision:214
   ├──⧉ name-99f85845f                 ReplicaSet   • ScaledDown  21h
   └──α name-99f85845f-214             AnalysisRun  ✔ Successful  21h  ✔ 103

Luckily with the changes in 1.1 having the SetWeight: 0 does not actually disrupt traffic but it is still concerning. I would expect progressDeadlineAbort not to abort after the deployment is complete.

Diagnostics

Controller 1.1

Here are the redacted logs for a rollout that had the issue. You can see that the rollout gets aborted due to progress at 2021-11-04T14:46:54Z, nearly half an hour after the rollout completed at 2021-11-04T14:18:15Z

https://gist.github.com/bpoland/537129feb598ae208a01463aeb347464


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@bpoland bpoland added the bug Something isn't working label Nov 4, 2021
@harikrongali
Copy link
Contributor

@bpoland can you also post the output of

kubectl get rollout name -o yaml
and
kubectl get rs name-7bc889bb44  -o yaml

@bpoland
Copy link
Contributor Author

bpoland commented Nov 5, 2021

Sorry I didn't get the exact one matching the above but the only difference will be the image used (which is redacted anyway :))

The rollout/rs below was in the same state when I grabbed them.

rollout yaml: https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05
replicaset yaml: https://gist.github.com/bpoland/8b8bc50c91a05b4f239c7a4bc6a8a30f

I've been investigating more and noticed that our containers are occasionally going not ready/moving around between k8s nodes which I think causes the rollout to go back into Progressing state even when there is no deployment in progress. Is it possible that if the rollout is below full readiness for an entire progress deadline due to pods being not ready, that would cause this issue?

@harikrongali
Copy link
Contributor

are the pods going from ready --> not ready?
I need to validate the above use case. Might be possible that pods going from ready to not ready might be causing the issue. Probable edge case.

I have one question: Is rollout becoming healthy after pods become ready again? or is it keeping as degraded after that?

@bpoland
Copy link
Contributor Author

bpoland commented Nov 5, 2021

Yes, some pods seem to be going ready, then not ready, then ready again. It seems like that might be causing the auto-abort.

Once it gets into a degraded state because of the auto-abort, then it stays degraded even after all pods are ready again. Clicking the "retry" button in the dashboard (or I assume doing the same with the cli) immediately makes the rollout healthy again, but that shouldn't be necessary -- the auto-abort should do nothing except when there is a canary going on.

@jessesuen
Copy link
Member

Yes, I think that must be the case. progressDeadlineAbort probably doesn't account for a long-running Rollout suddenly becoming unready outside upgrade window and immediately becoming Degraded.

/cc @huikang

@bpoland
Copy link
Contributor Author

bpoland commented Nov 8, 2021

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

@agill17
Copy link

agill17 commented Nov 9, 2021

I may be noticing a similar behavior with progressDeadlineAbort enabled. In my case sometimes when I rollout restart ( to recycle pods ) and when pods become unhealthy ( because probes failing since app is not fully up and running yet ) - the restart goes through successfully but marks rollout degraded. Also another thing I noticed after the restart was done and rollout was marked degraded, was that the rollouts controller kept adding and removing scaledown deadline annotation on the RS ( and was stuck in that loop until I deleted the entire rollout object - which is not great )

Below are the logs that continue after pods have successfully restarted, rollouts is marked degraded, the stable RS keeps getting updated to add/remove the scale down deadline annotation

time="2021-11-09T18:58:31Z" level=info msg="Scale down new rs 'app-5b7fdc469c' on abort (30s)" namespace=default rollout=app
time="2021-11-09T18:58:31Z" level=info msg="Enqueueing parent of default/app-5b7fdc469c: Rollout default/app"
time="2021-11-09T18:58:31Z" level=info msg="Set 'scale-down-deadline' annotation on 'app-5b7fdc469c' to 2021-11-09T18:59:01Z (30s)" namespace=default rollout=app

@huikang
Copy link
Member

huikang commented Nov 9, 2021

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

@bpoland, are you using workloadRef in your rollout cuz you mentioned deployment is not in progress?

UPDATE: I think @bpoland uses workloadRef by looking at the spec https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05#file-rollout-yaml-L151

@agill17 are you facing the issue with workloadRef too?

@huikang huikang closed this as completed Nov 9, 2021
@huikang huikang reopened this Nov 9, 2021
@huikang
Copy link
Member

huikang commented Nov 9, 2021

The logic to abort timeouted rollout is

// If condition is changed and ProgressDeadlineAbort is set, abort the update
if condChanged {
if c.rollout.Spec.ProgressDeadlineAbort {
c.pauseContext.AddAbort(msg)
c.recorder.Warnf(c.rollout, record.EventOptions{EventReason: conditions.RolloutAbortedReason}, msg)
}
} else {
// Although condition is unchanged, ProgressDeadlineAbort can be set after
// an existing update timeout. In this case if update is not aborted, we need to abort.
if c.rollout.Spec.ProgressDeadlineAbort && c.pauseContext != nil && !c.pauseContext.IsAborted() {
c.pauseContext.AddAbort(msg)
c.recorder.Warnf(c.rollout, record.EventOptions{EventReason: conditions.RolloutAbortedReason}, msg)
}

Looking into how to reproduce the error.

@agill17
Copy link

agill17 commented Nov 9, 2021

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

@bpoland, are you using workloadRef in your rollout cuz you mentioned deployment is not in progress?

UPDATE: I think @bpoland uses workloadRef by looking at the spec https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05#file-rollout-yaml-L151

@agill17 are you facing the issue with workloadRef too?

I am not using Deployment but directly using Rollout object.
Here is the rollout definition from the cluster ( after restarting ) -- As soon as I restarted rollout, it went into degraded state. Even after pods became healthy, the rollout is still marked "Degraded". And the RS that was restarted has the following annotation being added/removed in a loop scale-down-deadline: '2021-11-09T21:23:17Z'

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    rollout.argoproj.io/revision: '1'
  creationTimestamp: '2021-11-09T19:01:15Z'
  generation: 2
  labels:
    app: app
  name: app
  namespace: default
spec:
  progressDeadlineAbort: true
  replicas: 2
  restartAt: '2021-11-09T21:14:40Z'
  selector:
    matchLabels:
      app: app
  strategy:
    canary:
      analysis:
        args:
          - name: appName
            value: app
          - name: canaryHash
            valueFrom:
              podTemplateHashValue: Latest
        templates:
          - clusterScope: true
            templateName: some-template
      steps:
        - setWeight: 10
        - pause:
            duration: 60
        - setWeight: 90
        - pause:
            duration: 60
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: app-canary
            name: app
            stableSubsetName: app-stable
          virtualService:
            name: app
            routes:
              - app-canary
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
        - env:
            - name: foo
              value: bar
          image: 'app:latest'
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 7
          name: app
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 7
      serviceAccountName: app
      terminationGracePeriodSeconds: 10
status:
  HPAReplicas: 2
  abort: true
  abortedAt: '2021-11-09T21:14:41Z'
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 5b7fdc469c
        weight: 0
      stable:
        podTemplateHash: 5b7fdc469c
        weight: 100
  conditions:
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: Rollout aborted update to revision 1
      reason: RolloutAborted
      status: 'False'
      type: Progressing
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'False'
      type: Completed
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: Rollout does not have minimum availability
      reason: AvailableReason
      status: 'False'
      type: Available
  currentPodHash: 5b7fdc469c
  currentStepHash: 79c8f5b474
  currentStepIndex: 18
  message: 'RolloutAborted: Rollout aborted update to revision 1'
  observedGeneration: '2'
  phase: Degraded
  readyReplicas: 1
  replicas: 2
  selector: app=app
  stableRS: 5b7fdc469c
  updatedReplicas: 2

And here is a view from argo rollouts CLI

Name:            app
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 1
Strategy:        Canary
  Step:          18/18
  SetWeight:     0
  ActualWeight:  100
Images:         app:latest (stable)
Replicas:
  Desired:       2
  Current:       2
  Updated:       2
  Ready:         2
  Available:     2

NAME                                          KIND        STATUS      AGE    INFO
⟳ app                            Rollout     ✖ Degraded  143m
└──# revision:1
   └──⧉ app-5b7fdc469c           ReplicaSet  ✔ Healthy   143m   stable
      ├──□ app-5b7fdc469c-7mtwq  Pod         ✔ Running   9m39s  ready:2/2
      └──□ app-5b7fdc469c-vbsmg  Pod         ✔ Running   8m4s   ready:2/2

@huikang
Copy link
Member

huikang commented Nov 11, 2021

@agill17 , could you help test if the PR #1649 works for you? Thanks.

@bpoland
Copy link
Contributor Author

bpoland commented Nov 11, 2021

Sorry for the delayed response, but yes I am using workloadRef. Do you think that's related to the issue?

Do you think the above PR will fix the issue when pods are going not ready and no rollout is in progress?

@huikang
Copy link
Member

huikang commented Nov 11, 2021

Sorry for the delayed response, but yes I am using workloadRef. Do you think that's related to the issue?

no problem. Whether using workloadRef doesn't matter since I can reproduce the error for rollout w/ and w/o workloadRef.

Do you think the above PR will fix the issue when pods are going not ready and no rollout is in progress?

Yes, this PR should fix the issue. Please let me know what you will find. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cherry-pick/release-1.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants