ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

dodwmd · 2024-03-21T00:52:27Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug

We have a number of rollouts that have analysist templates attached. The rollout shows that the replicas set has timed out but the replicas set is showing that it's fine. Restarting the replicas set seems to fix the problem. We also used to have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.

To Reproduce

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: XXXXX
  namespace: XXXXX
spec:
  progressDeadlineAbort: true
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: XXXXX
  strategy:
    canary:
      analysis:
        analysisRunMetadata: {}
        args:
          - name: service-name
            value: XXXXX-service-v1
        startingStep: 1
        templates:
          - templateName: XXXXX
      canaryMetadata:
        labels:
          rollout-status: canary
      canaryService: XXXXX-canary
      maxSurge: 25%
      maxUnavailable: 0
      stableMetadata:
        labels:
          rollout-status: stable
      stableService: XXXXXX-service
      steps:
        - setWeight: 10
        - pause: {}
      trafficRouting:
        nginx:
          additionalIngressAnnotations:
            canary-by-header: X-Canary
            canary-by-header-value: true
          stableIngress: XXXXXX
  template:
    spec:
      containers:
        - image: 'gcr.io/XXXXXXXXX'
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '20'
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-liveness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: XXXXX
          ports:
            - containerPort: 8080
              name: http-server
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 2Gi
            requests:
              memory: 1Gi

Expected behavior

Rollout to reflect the replicaset's current state

Screenshots

From rollout:

  conditions:
    - lastTransitionTime: '2024-03-19T04:12:57Z'
      lastUpdateTime: '2024-03-19T04:12:57Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-03-19T15:44:55Z'
      lastUpdateTime: '2024-03-19T15:44:55Z'
      message: Rollout is not healthy
      reason: RolloutHealthy
      status: 'False'
      type: Healthy
    - lastTransitionTime: '2024-03-21T00:07:38Z'
      lastUpdateTime: '2024-03-21T00:07:38Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2024-03-21T00:07:59Z'
      lastUpdateTime: '2024-03-21T00:07:59Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2024-03-21T00:18:30Z'
      lastUpdateTime: '2024-03-21T00:18:30Z'
      message: ReplicaSet "XXXXX-6645b789db" has timed out progressing.
      reason: ProgressDeadlineExceeded
      status: 'False'
      type: Progressing

From Replicaset

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  annotations:
    rollout.argoproj.io/desired-replicas: '2'
    rollout.argoproj.io/ephemeral-metadata: >-
....
  name: XXXXX-6645b789db
  namespace: XXXXX
....
status:
  availableReplicas: 2
  fullyLabeledReplicas: 2
  observedGeneration: 4
  readyReplicas: 2
  replicas: 2

Version

image: 'quay.io/argoproj/argo-rollouts:v1.6.6'

Logs

time=\"2024-03-21T00:30:26Z\" level=warning msg=\"ReplicaSet \\\"XXXXX-6645b789db\\\" has timed out progressing.\" event_reason=RolloutAborted namespace=XXX rollout=XXXXX

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

vadasambar · 2024-05-13T07:02:15Z

We have seen this happen with blue green strategy (no analysis templates were used).

"time="2024-05-01T04:09:27Z" level=error msg="rollout syncHandler error: failed to reconcileBlueGreenReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileBlueGreenStableReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset xxx-xxx: Operation cannot be fulfilled on replicasets.apps \"xxx-xxx\": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx-xxx rollout=xxx-xxx",

"time="2024-05-01T04:09:27Z" level=info msg="rollout syncHandler queue retries: 40 : key \"xxx-xxx/xxxx\"" namespace=xxx-xxx rollout=xxx-xxx",

It seems like the informer cache is not updated soon enough.
This happens only for specific rollout resources.

kaiburjack · 2024-07-01T21:24:32Z

We are also seeing this with a blue/green Rollout with AnalysisTemplate set. Restarting resolves the issue, but it resurfaces at irregular intervals. I have the slight assumption that this always happens when the Rollout did progress to all pods being ready fine at some point, but then some pods go killed (due to being evicted or due to an imminent node shutdown -- we are using spot nodes) and that killing may have happened after the progress deadline duration of the Rollout.

OneCricketeer · 2024-09-28T03:12:13Z

We have no AnalysisTemplates

have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.

In our case, deleting all old replicasets has fixed this status, starting from the oldest. Perhaps only the most recent was needed, but wanted to be safe.

dodwmd added the bug label Mar 21, 2024

tbronchain mentioned this issue Jul 11, 2024

fix(dashboard): Update pod status logic to support native sidecars. Fixes #3366 #3639

Merged

6 tasks

chicocvenancio mentioned this issue Oct 8, 2024

fix(controller): resync replicaset if spec.replicas differs #3880

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

dodwmd commented Mar 21, 2024 •

edited

Loading

vadasambar commented May 13, 2024

kaiburjack commented Jul 1, 2024 •

edited

Loading

OneCricketeer commented Sep 28, 2024 •

edited

Loading

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

Comments

dodwmd commented Mar 21, 2024 • edited Loading

vadasambar commented May 13, 2024

kaiburjack commented Jul 1, 2024 • edited Loading

OneCricketeer commented Sep 28, 2024 • edited Loading

dodwmd commented Mar 21, 2024 •

edited

Loading

kaiburjack commented Jul 1, 2024 •

edited

Loading

OneCricketeer commented Sep 28, 2024 •

edited

Loading