Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

Open
2 tasks done
dodwmd opened this issue Mar 21, 2024 · 3 comments
Open
2 tasks done

ProgressDeadlineExceeded: ReplicaSet has timed out progressing #3457

dodwmd opened this issue Mar 21, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@dodwmd
Copy link

dodwmd commented Mar 21, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

We have a number of rollouts that have analysist templates attached. The rollout shows that the replicas set has timed out but the replicas set is showing that it's fine. Restarting the replicas set seems to fix the problem. We also used to have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.

To Reproduce

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: XXXXX
  namespace: XXXXX
spec:
  progressDeadlineAbort: true
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: XXXXX
  strategy:
    canary:
      analysis:
        analysisRunMetadata: {}
        args:
          - name: service-name
            value: XXXXX-service-v1
        startingStep: 1
        templates:
          - templateName: XXXXX
      canaryMetadata:
        labels:
          rollout-status: canary
      canaryService: XXXXX-canary
      maxSurge: 25%
      maxUnavailable: 0
      stableMetadata:
        labels:
          rollout-status: stable
      stableService: XXXXXX-service
      steps:
        - setWeight: 10
        - pause: {}
      trafficRouting:
        nginx:
          additionalIngressAnnotations:
            canary-by-header: X-Canary
            canary-by-header-value: true
          stableIngress: XXXXXX
  template:
    spec:
      containers:
        - image: 'gcr.io/XXXXXXXXX'
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '20'
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-liveness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: XXXXX
          ports:
            - containerPort: 8080
              name: http-server
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 2Gi
            requests:
              memory: 1Gi

Expected behavior

Rollout to reflect the replicaset's current state

Screenshots

From rollout:

  conditions:
    - lastTransitionTime: '2024-03-19T04:12:57Z'
      lastUpdateTime: '2024-03-19T04:12:57Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-03-19T15:44:55Z'
      lastUpdateTime: '2024-03-19T15:44:55Z'
      message: Rollout is not healthy
      reason: RolloutHealthy
      status: 'False'
      type: Healthy
    - lastTransitionTime: '2024-03-21T00:07:38Z'
      lastUpdateTime: '2024-03-21T00:07:38Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2024-03-21T00:07:59Z'
      lastUpdateTime: '2024-03-21T00:07:59Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2024-03-21T00:18:30Z'
      lastUpdateTime: '2024-03-21T00:18:30Z'
      message: ReplicaSet "XXXXX-6645b789db" has timed out progressing.
      reason: ProgressDeadlineExceeded
      status: 'False'
      type: Progressing

From Replicaset

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  annotations:
    rollout.argoproj.io/desired-replicas: '2'
    rollout.argoproj.io/ephemeral-metadata: >-
....
  name: XXXXX-6645b789db
  namespace: XXXXX
....
status:
  availableReplicas: 2
  fullyLabeledReplicas: 2
  observedGeneration: 4
  readyReplicas: 2
  replicas: 2

Version

image: 'quay.io/argoproj/argo-rollouts:v1.6.6'

Logs

time=\"2024-03-21T00:30:26Z\" level=warning msg=\"ReplicaSet \\\"XXXXX-6645b789db\\\" has timed out progressing.\" event_reason=RolloutAborted namespace=XXX rollout=XXXXX

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@dodwmd dodwmd added the bug Something isn't working label Mar 21, 2024
@vadasambar
Copy link

We have seen this happen with blue green strategy (no analysis templates were used).

"time="2024-05-01T04:09:27Z" level=error msg="rollout syncHandler error: failed to reconcileBlueGreenReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileBlueGreenStableReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset xxx-xxx: Operation cannot be fulfilled on replicasets.apps \"xxx-xxx\": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx-xxx rollout=xxx-xxx",

"time="2024-05-01T04:09:27Z" level=info msg="rollout syncHandler queue retries: 40 : key \"xxx-xxx/xxxx\"" namespace=xxx-xxx rollout=xxx-xxx",

image

It seems like the informer cache is not updated soon enough.
This happens only for specific rollout resources.

@kaiburjack
Copy link

kaiburjack commented Jul 1, 2024

We are also seeing this with a blue/green Rollout with AnalysisTemplate set. Restarting resolves the issue, but it resurfaces at irregular intervals. I have the slight assumption that this always happens when the Rollout did progress to all pods being ready fine at some point, but then some pods go killed (due to being evicted or due to an imminent node shutdown -- we are using spot nodes) and that killing may have happened after the progress deadline duration of the Rollout.

@OneCricketeer
Copy link

OneCricketeer commented Sep 28, 2024

We have no AnalysisTemplates

have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.

In our case, deleting all old replicasets has fixed this status, starting from the oldest. Perhaps only the most recent was needed, but wanted to be safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants