`progressDeadlineAbort` causes degraded Rollout outside of a deployment #1624

bpoland · 2021-11-04T15:26:24Z

Summary

It seems that with progressDeadlineAbort: true that some Rollouts randomly get marked as degraded when no deployment is in progress. This has been happening repeatedly since I turned on progressDeadlineAbort (and stops happening when I turn that back off).

When the issue happens, the Rollout is left like this:

Name:            name
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 216
Strategy:        Canary
  Step:          20/20
  SetWeight:     0
  ActualWeight:  100
Images:          image:tag (stable)
Replicas:
  Desired:       3
  Current:       3
  Updated:       3
  Ready:         3
  Available:     3

NAME                                    KIND         STATUS        AGE  INFO
⟳ name                                 Rollout      ✖ Degraded    22d
├──# revision:216
│  ├──⧉ name-7bc889bb44                ReplicaSet   ✔ Healthy     64m  stable
│  │  ├──□ name-7bc889bb44-79cvk       Pod          ✔ Running     63m  ready:2/2
│  │  ├──□ name-7bc889bb44-5nwqw       Pod          ✔ Running     59m  ready:2/2
│  │  └──□ name-7bc889bb44-nqqt8       Pod          ✔ Running     53m  ready:2/2
│  └──α name-7bc889bb44-216            AnalysisRun  ✔ Successful  64m  ✔ 92
├──# revision:215
│  ├──⧉ name-5f5d69cb8                 ReplicaSet   • ScaledDown  98m
│  └──α name-5f5d69cb8-215             AnalysisRun  ✔ Successful  98m  ✔ 108
└──# revision:214
   ├──⧉ name-99f85845f                 ReplicaSet   • ScaledDown  21h
   └──α name-99f85845f-214             AnalysisRun  ✔ Successful  21h  ✔ 103

Luckily with the changes in 1.1 having the SetWeight: 0 does not actually disrupt traffic but it is still concerning. I would expect progressDeadlineAbort not to abort after the deployment is complete.

Diagnostics

Controller 1.1

Here are the redacted logs for a rollout that had the issue. You can see that the rollout gets aborted due to progress at 2021-11-04T14:46:54Z, nearly half an hour after the rollout completed at 2021-11-04T14:18:15Z

https://gist.github.com/bpoland/537129feb598ae208a01463aeb347464

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

harikrongali · 2021-11-04T16:45:34Z

@bpoland can you also post the output of

kubectl get rollout name -o yaml
and
kubectl get rs name-7bc889bb44  -o yaml

bpoland · 2021-11-05T16:35:44Z

Sorry I didn't get the exact one matching the above but the only difference will be the image used (which is redacted anyway :))

The rollout/rs below was in the same state when I grabbed them.

rollout yaml: https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05
replicaset yaml: https://gist.github.com/bpoland/8b8bc50c91a05b4f239c7a4bc6a8a30f

I've been investigating more and noticed that our containers are occasionally going not ready/moving around between k8s nodes which I think causes the rollout to go back into Progressing state even when there is no deployment in progress. Is it possible that if the rollout is below full readiness for an entire progress deadline due to pods being not ready, that would cause this issue?

harikrongali · 2021-11-05T23:25:39Z

are the pods going from ready --> not ready?
I need to validate the above use case. Might be possible that pods going from ready to not ready might be causing the issue. Probable edge case.

I have one question: Is rollout becoming healthy after pods become ready again? or is it keeping as degraded after that?

bpoland · 2021-11-05T23:49:28Z

Yes, some pods seem to be going ready, then not ready, then ready again. It seems like that might be causing the auto-abort.

Once it gets into a degraded state because of the auto-abort, then it stays degraded even after all pods are ready again. Clicking the "retry" button in the dashboard (or I assume doing the same with the cli) immediately makes the rollout healthy again, but that shouldn't be necessary -- the auto-abort should do nothing except when there is a canary going on.

jessesuen · 2021-11-08T23:36:22Z

Yes, I think that must be the case. progressDeadlineAbort probably doesn't account for a long-running Rollout suddenly becoming unready outside upgrade window and immediately becoming Degraded.

/cc @huikang

bpoland · 2021-11-08T23:55:01Z

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

agill17 · 2021-11-09T16:53:04Z

I may be noticing a similar behavior with progressDeadlineAbort enabled. In my case sometimes when I rollout restart ( to recycle pods ) and when pods become unhealthy ( because probes failing since app is not fully up and running yet ) - the restart goes through successfully but marks rollout degraded. Also another thing I noticed after the restart was done and rollout was marked degraded, was that the rollouts controller kept adding and removing scaledown deadline annotation on the RS ( and was stuck in that loop until I deleted the entire rollout object - which is not great )

Below are the logs that continue after pods have successfully restarted, rollouts is marked degraded, the stable RS keeps getting updated to add/remove the scale down deadline annotation

time="2021-11-09T18:58:31Z" level=info msg="Scale down new rs 'app-5b7fdc469c' on abort (30s)" namespace=default rollout=app
time="2021-11-09T18:58:31Z" level=info msg="Enqueueing parent of default/app-5b7fdc469c: Rollout default/app"
time="2021-11-09T18:58:31Z" level=info msg="Set 'scale-down-deadline' annotation on 'app-5b7fdc469c' to 2021-11-09T18:59:01Z (30s)" namespace=default rollout=app

huikang · 2021-11-09T20:54:00Z

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

@bpoland, are you using workloadRef in your rollout cuz you mentioned deployment is not in progress?

UPDATE: I think @bpoland uses workloadRef by looking at the spec https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05#file-rollout-yaml-L151

@agill17 are you facing the issue with workloadRef too?

huikang · 2021-11-09T21:03:23Z

The logic to abort timeouted rollout is

argo-rollouts/rollout/sync.go

Lines 655 to 667 in dc1c11b

    
           // If condition is changed and ProgressDeadlineAbort is set, abort the update 
        
           if condChanged { 
        
           	if c.rollout.Spec.ProgressDeadlineAbort { 
        
           		c.pauseContext.AddAbort(msg) 
        
           		c.recorder.Warnf(c.rollout, record.EventOptions{EventReason: conditions.RolloutAbortedReason}, msg) 
        
           	} 
        
           } else { 
        
           	// Although condition is unchanged, ProgressDeadlineAbort can be set after 
        
           	// an existing update timeout. In this case if update is not aborted, we need to abort. 
        
           	if c.rollout.Spec.ProgressDeadlineAbort && c.pauseContext != nil && !c.pauseContext.IsAborted() { 
        
           		c.pauseContext.AddAbort(msg) 
        
           		c.recorder.Warnf(c.rollout, record.EventOptions{EventReason: conditions.RolloutAbortedReason}, msg) 
        
           	}

Looking into how to reproduce the error.

agill17 · 2021-11-09T21:23:10Z

Maybe the Rollout should not be marked as Progressing when a deployment is not in progress, and a pod goes non-ready? That also seems a little strange to me, and might be related?

@bpoland, are you using workloadRef in your rollout cuz you mentioned deployment is not in progress?

UPDATE: I think @bpoland uses workloadRef by looking at the spec https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05#file-rollout-yaml-L151

@agill17 are you facing the issue with workloadRef too?

I am not using Deployment but directly using Rollout object.
Here is the rollout definition from the cluster ( after restarting ) -- As soon as I restarted rollout, it went into degraded state. Even after pods became healthy, the rollout is still marked "Degraded". And the RS that was restarted has the following annotation being added/removed in a loop scale-down-deadline: '2021-11-09T21:23:17Z'

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    rollout.argoproj.io/revision: '1'
  creationTimestamp: '2021-11-09T19:01:15Z'
  generation: 2
  labels:
    app: app
  name: app
  namespace: default
spec:
  progressDeadlineAbort: true
  replicas: 2
  restartAt: '2021-11-09T21:14:40Z'
  selector:
    matchLabels:
      app: app
  strategy:
    canary:
      analysis:
        args:
          - name: appName
            value: app
          - name: canaryHash
            valueFrom:
              podTemplateHashValue: Latest
        templates:
          - clusterScope: true
            templateName: some-template
      steps:
        - setWeight: 10
        - pause:
            duration: 60
        - setWeight: 90
        - pause:
            duration: 60
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: app-canary
            name: app
            stableSubsetName: app-stable
          virtualService:
            name: app
            routes:
              - app-canary
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
        - env:
            - name: foo
              value: bar
          image: 'app:latest'
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 7
          name: app
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 7
      serviceAccountName: app
      terminationGracePeriodSeconds: 10
status:
  HPAReplicas: 2
  abort: true
  abortedAt: '2021-11-09T21:14:41Z'
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 5b7fdc469c
        weight: 0
      stable:
        podTemplateHash: 5b7fdc469c
        weight: 100
  conditions:
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: Rollout aborted update to revision 1
      reason: RolloutAborted
      status: 'False'
      type: Progressing
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'False'
      type: Completed
    - lastTransitionTime: '2021-11-09T21:14:41Z'
      lastUpdateTime: '2021-11-09T21:14:41Z'
      message: Rollout does not have minimum availability
      reason: AvailableReason
      status: 'False'
      type: Available
  currentPodHash: 5b7fdc469c
  currentStepHash: 79c8f5b474
  currentStepIndex: 18
  message: 'RolloutAborted: Rollout aborted update to revision 1'
  observedGeneration: '2'
  phase: Degraded
  readyReplicas: 1
  replicas: 2
  selector: app=app
  stableRS: 5b7fdc469c
  updatedReplicas: 2

And here is a view from argo rollouts CLI

Name:            app
Namespace:       default
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 1
Strategy:        Canary
  Step:          18/18
  SetWeight:     0
  ActualWeight:  100
Images:         app:latest (stable)
Replicas:
  Desired:       2
  Current:       2
  Updated:       2
  Ready:         2
  Available:     2

NAME                                          KIND        STATUS      AGE    INFO
⟳ app                            Rollout     ✖ Degraded  143m
└──# revision:1
   └──⧉ app-5b7fdc469c           ReplicaSet  ✔ Healthy   143m   stable
      ├──□ app-5b7fdc469c-7mtwq  Pod         ✔ Running   9m39s  ready:2/2
      └──□ app-5b7fdc469c-vbsmg  Pod         ✔ Running   8m4s   ready:2/2

huikang · 2021-11-11T05:38:05Z

@agill17 , could you help test if the PR #1649 works for you? Thanks.

bpoland · 2021-11-11T12:45:15Z

Sorry for the delayed response, but yes I am using workloadRef. Do you think that's related to the issue?

Do you think the above PR will fix the issue when pods are going not ready and no rollout is in progress?

huikang · 2021-11-11T14:43:54Z

Sorry for the delayed response, but yes I am using workloadRef. Do you think that's related to the issue?

no problem. Whether using workloadRef doesn't matter since I can reproduce the error for rollout w/ and w/o workloadRef.

Do you think the above PR will fix the issue when pods are going not ready and no rollout is in progress?

Yes, this PR should fix the issue. Please let me know what you will find. Thanks.

bpoland added the bug Something isn't working label Nov 4, 2021

jessesuen added the cherry-pick/release-1.1 label Nov 8, 2021

huikang closed this as completed Nov 9, 2021

huikang reopened this Nov 9, 2021

huikang self-assigned this Nov 10, 2021

huikang mentioned this issue Nov 11, 2021

fix: reset the progress condition when a pod is restarted #1649

Merged

6 tasks

alexmt closed this as completed in #1649 Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`progressDeadlineAbort` causes degraded Rollout outside of a deployment #1624

`progressDeadlineAbort` causes degraded Rollout outside of a deployment #1624

bpoland commented Nov 4, 2021

harikrongali commented Nov 4, 2021

bpoland commented Nov 5, 2021 •

edited

Loading

harikrongali commented Nov 5, 2021

bpoland commented Nov 5, 2021

jessesuen commented Nov 8, 2021

bpoland commented Nov 8, 2021

agill17 commented Nov 9, 2021 •

edited

Loading

huikang commented Nov 9, 2021 •

edited

Loading

huikang commented Nov 9, 2021

agill17 commented Nov 9, 2021 •

edited

Loading

huikang commented Nov 11, 2021

bpoland commented Nov 11, 2021

huikang commented Nov 11, 2021

progressDeadlineAbort causes degraded Rollout outside of a deployment #1624

progressDeadlineAbort causes degraded Rollout outside of a deployment #1624

Comments

bpoland commented Nov 4, 2021

Summary

Diagnostics

harikrongali commented Nov 4, 2021

bpoland commented Nov 5, 2021 • edited Loading

harikrongali commented Nov 5, 2021

bpoland commented Nov 5, 2021

jessesuen commented Nov 8, 2021

bpoland commented Nov 8, 2021

agill17 commented Nov 9, 2021 • edited Loading

huikang commented Nov 9, 2021 • edited Loading

huikang commented Nov 9, 2021

agill17 commented Nov 9, 2021 • edited Loading

huikang commented Nov 11, 2021

bpoland commented Nov 11, 2021

huikang commented Nov 11, 2021

`progressDeadlineAbort` causes degraded Rollout outside of a deployment #1624

`progressDeadlineAbort` causes degraded Rollout outside of a deployment #1624

bpoland commented Nov 5, 2021 •

edited

Loading

agill17 commented Nov 9, 2021 •

edited

Loading

huikang commented Nov 9, 2021 •

edited

Loading

agill17 commented Nov 9, 2021 •

edited

Loading