Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote full did not work against BlueGreen with previewReplicaCount #1383

Closed
alexmt opened this issue Jul 30, 2021 · 3 comments · Fixed by #1384
Closed

Promote full did not work against BlueGreen with previewReplicaCount #1383

alexmt opened this issue Jul 30, 2021 · 3 comments · Fixed by #1384
Labels
bug Something isn't working

Comments

@alexmt
Copy link
Contributor

alexmt commented Jul 30, 2021

Summary

Rollout stuck in degrade state during blue-green rollout.

Diagnostics

Rollout got paused (as expected) during blue-green. It was not approved within progressDeadlineSeconds timeout and switched into Degraded state. Promote was executed using Argo CD action but nothing happened - rollout stayed in Degraded state.

Based on logs rollout got unpaused but controller was trying to set previewReplicaCount to the new replicaset instead of spec.replicas.

  strategy:
    blueGreen:
      activeService: paaf-active-service
      antiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution: {}
      autoPromotionEnabled: false
      previewReplicaCount: 1
      previewService: paaf-preview-service
      scaleDownDelayRevisionLimit: 2
      scaleDownDelaySeconds: 300

status:

status:
  HPAReplicas: 3
  availableReplicas: 3
  blueGreen:
    activeSelector: 77954699c5
    previewSelector: 6599fdc6d
  canary: {}
  conditions:
    - lastTransitionTime: '2021-07-30T08:17:29Z'
      lastUpdateTime: '2021-07-30T08:17:29Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'False'
      type: Completed
    - lastTransitionTime: '2021-07-30T08:45:37Z'
      lastUpdateTime: '2021-07-30T08:45:37Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2021-07-30T09:17:29Z'
      lastUpdateTime: '2021-07-30T09:17:29Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2021-07-30T09:39:12Z'
      lastUpdateTime: '2021-07-30T09:39:12Z'
      message: ReplicaSet "reducted-rollout-6599fdc6d" has timed out progressing.
      reason: ProgressDeadlineExceeded
      status: 'False'
      type: Progressing
  currentPodHash: 6599fdc6d
  message: >-
    ProgressDeadlineExceeded: ReplicaSet "reducted-rollout-6599fdc6d" has timed out
    progressing.
  observedGeneration: '22'
  phase: Degraded
  promoteFull: true
  readyReplicas: 3
  replicas: 4
  restartedAt: '2021-07-30T09:07:56Z'
  selector: 'app=reducted,rollouts-pod-template-hash=77954699c5'
  stableRS: 77954699c5
  updatedReplicas: 1

Version: v1.0.2

Rollout has previewReplicas

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME>

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@alexmt alexmt added the bug Something isn't working label Jul 30, 2021
@jessesuen
Copy link
Member

Rollout got paused (as expected) during blue-green. It was not approved within progressDeadlineSeconds timeout and switched into Degraded state.

this is not supposed to happen.

@jessesuen
Copy link
Member

jessesuen commented Jul 30, 2021

It was not approved within progressDeadlineSeconds timeout and switched into Degraded state.

We determined the "Degraded" state was because of outdated client code (Argo CD Rollout health check) and outdated kubectl plugin and the rollout was not truly degraded (despite bad reporting in Argo CD).

Here was the sequence of Kubernetes events, which show that the rollout was never degraded from the perspective of the rollout controller.

  | time="2021-07-30T08:17:29Z" level=info msg="Rollout updated to revision 5" event_reason=RolloutUpdated namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T08:17:29Z" level=info msg="Created ReplicaSet redacted-rollout-6599fdc6d (revision 5) with size 1" event_reason=NewReplicaSetCreated namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T08:17:29Z" level=info msg="Switched selector for service 'paaf-preview-service' from '77954699c5' to '6599fdc6d'" event_reason=SwitchService namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T08:26:16Z" level=info msg="Rollout is paused (BlueGreenPause)" event_reason=RolloutPaused namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T08:45:37Z" level=info msg="Rollout is resumed" event_reason=RolloutResumed namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:15:43Z" level=info msg="Scaled down ReplicaSet redacted-rollout-6599fdc6d (revision 5) from 3 to 1" event_reason=ScalingReplicaSet namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:16:47Z" level=info msg="Scaled down ReplicaSet redacted-rollout-6599fdc6d (revision 5) from 3 to 1" event_reason=ScalingReplicaSet namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:38:17Z" level=info msg="Rollout completed update to revision 5 (6599fdc6d): Initial deploy" event_reason=RolloutCompleted namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:38:17Z" level=info msg="Scaled up ReplicaSet redacted-rollout-6599fdc6d (revision 5) from 1 to 3" event_reason=ScalingReplicaSet namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:43:17Z" level=info msg="Scaled down ReplicaSet redacted-rollout-77954699c5 (revision 4) from 3 to 0" event_reason=ScalingReplicaSet namespace=redacted-namespace rollout=redacted-rollout
  | time="2021-07-30T15:45:59Z" level=info msg="Switched selector for service 'paaf-active-service' from '' to '6599fdc6d'" event_reason=SwitchService namespace=redacted-namespace rollout=redacted-rollout

We also determined that this is a bug with the previewReplicaCount feature. Basically, it is not properly scaling up the preview stack after the full promotion.

@jessesuen jessesuen changed the title BlueGreen rollout stuck in Degraded state Promote full did not work against BlueGreen with previewReplicaCount Jul 30, 2021
@eilonmonday
Copy link

@jessesuen after upgrading to 1.0.4 i get the error:
E0801 13:11:51.164664 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:argo-rollouts:argo-rollouts" cannot list resource "configmaps" in API group "" in the namespace "argo-rollouts"

this was the first issue #1359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants