Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout in degraded state when progressDeadlineSeconds < scaleDownDelaySeconds #3414

Closed
2 tasks done
yohanb opened this issue Feb 29, 2024 · 1 comment · Fixed by #3417
Closed
2 tasks done

Rollout in degraded state when progressDeadlineSeconds < scaleDownDelaySeconds #3414

yohanb opened this issue Feb 29, 2024 · 1 comment · Fixed by #3417
Labels
bug Something isn't working

Comments

@yohanb
Copy link
Contributor

yohanb commented Feb 29, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

When using the "fast track rollback" feature and keeping old revisions scaled up for a set amount of time, I noticed that the Rollout falls into a degraded state with the message ProgressDeadlineExceeded: ReplicaSet "xxx" has timed out progressing.

After investigation, I realized:

  1. it was happening after ~10 minutes, which is the default value for progressDeadlineSeconds.
  2. the rollout fell back into a healthy state once the old revision scaled down after the scaleDownDelaySeconds (initially set to 24 hours)

I then tried setting progressDeadlineSeconds < scaleDownDelaySeconds (600 seconds and 900 seconds respectively) and that fixes the issue.

To Reproduce

Here's an example

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:
    app.kubernetes.io/instance: argo-rollouts-demo-xyz-v02
  name: argo-rollouts-demo
  namespace: argo-rollouts-demo
spec:
  progressDeadlineSeconds: 300
  revisionHistoryLimit: 3
  rollbackWindow:
    revisions: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: argo-rollouts-demo
  strategy:
    canary:
      canaryService: preview
      scaleDownDelayRevisionLimit: 2
      scaleDownDelaySeconds: 600
      stableService: stable
      steps:
        - setWeight: 20
        - setCanaryScale:
            weight: 100
        - pause: {}
        - setWeight: 80
        - pause:
            duration: 10
      trafficRouting:
        istio:
          virtualService:
            name: argo-rollouts-demo
            routes:
              - primary
  template:
    metadata:
      labels:
        app: prototype
        app.kubernetes.io/name: argo-rollouts-demo
        istio.io/rev: stable
        version: green
    spec:
      containers:
        - image: 'argoproj/rollouts-demo:green'
          name: rollouts-demo
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
          resources:
            requests:
              cpu: 5m
              memory: 32Mi

Expected behavior

Once the Rollout is deployed and healthy, I expect being able to set scaleDownDelaySeconds to whatever reasonable value without it affecting the Rollout's progress deadline.

Screenshots

Monosnap core-services 2024-02-29 10-55-49

Monosnap core-services 2024-02-29 10-55-06

  • After the scale down delay has passed and the old replicaset pods were removed. *
    Monosnap core-services 2024-02-29 11-02-02

Version

Tried with 1.6.0 and then with 1.6.6

Logs

# Logs for a specific rollout:
time="2024-02-29T15:54:43Z" level=info msg="Started syncing rollout" generation=32 namespace=argo-rollouts-demo resourceVersion=92137099 rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciling 1 old ReplicaSets (total pods: 3)" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 3 available pods in old RS argo-rollouts-demo/argo-rollouts-demo-64d9cb66bc" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 6 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 3)" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="RS 'argo-rollouts-demo-64d9cb66bc' has not reached the scaleDownTime" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="No Steps remain in the canary steps" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Timed out (true) [last progress check: 2024-02-29 15:49:42 +0000 UTC - now: 2024-02-29 15:54:43.002935953 +0000 UTC m=+3776.154484347]" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-02-29T03:25:36Z\",\"lastUpdateTime\":\"2024-02-29T03:25:36Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-02-29T15:48:38Z\",\"lastUpdateTime\":\"2024-02-29T15:48:38Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-02-29T15:49:42Z\",\"lastUpdateTime\":\"2024-02-29T15:49:42Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-02-29T15:49:42Z\",\"lastUpdateTime\":\"2024-02-29T15:49:42Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-02-29T15:54:43Z\",\"lastUpdateTime\":\"2024-02-29T15:54:43Z\",\"message\":\"ReplicaSet \\\"argo-rollouts-demo-5b8f9df7dc\\\" has timed out progressing.\",\"reason\":\"ProgressDeadlineExceeded\",\"status\":\"False\",\"type\":\"Progressing\"}],\"message\":\"ProgressDeadlineExceeded: ReplicaSet \\\"argo-rollouts-demo-5b8f9df7dc\\\" has timed out progressing.\",\"phase\":\"Degraded\"}}" generation=32 namespace=argo-rollouts-demo resourceVersion=92137099 rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="persisted to informer" generation=32 namespace=argo-rollouts-demo resourceVersion=92141430 rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciliation completed" generation=32 namespace=argo-rollouts-demo resourceVersion=92137099 rollout=argo-rollouts-demo time_ms=22.69182
time="2024-02-29T15:54:43Z" level=info msg="Start processing" resource=argo-rollouts-demo/argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Processing completed" resource=argo-rollouts-demo/argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Started syncing rollout" generation=32 namespace=argo-rollouts-demo resourceVersion=92141430 rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciling 1 old ReplicaSets (total pods: 3)" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 3 available pods in old RS argo-rollouts-demo/argo-rollouts-demo-64d9cb66bc" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Found 6 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 3)" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="RS 'argo-rollouts-demo-64d9cb66bc' has not reached the scaleDownTime" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="No Steps remain in the canary steps" namespace=argo-rollouts-demo rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="No status changes. Skipping patch" generation=32 namespace=argo-rollouts-demo resourceVersion=92141430 rollout=argo-rollouts-demo
time="2024-02-29T15:54:43Z" level=info msg="Reconciliation completed" generation=32 namespace=argo-rollouts-demo resourceVersion=92141430 rollout=argo-rollouts-demo time_ms=2.626357

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@yohanb yohanb added the bug Something isn't working label Feb 29, 2024
@amazingandyyy
Copy link
Contributor

🆙 the degraded state scares user off from using the scaleDownDelaySeconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants