Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary rollouts with a failed AnalysisRun immediately terminate all canary pods and ignores maxUnavailable #777

Closed
abatilo opened this issue Oct 13, 2020 · 2 comments · Fixed by #786
Assignees
Labels
bug Something isn't working cherry-pick/release-0.9 regression Regression from previous behavior

Comments

@abatilo
Copy link
Contributor

abatilo commented Oct 13, 2020

Summary

When doing a canary rollout with maxUnavailable: 0, if I have an AnalysisRun fail, all of the canary pods are immediately terminated even though the previous revision has no available pods.

Diagnostics

Versions of all tools:

argo-rollouts 0.9.1
helm 3.2.1
kind 0.8.1
kubectl 1.18.2

Repro yaml can be found here.

Steps to repro:

#!/bin/bash
kind create cluster
helm repo add argo https://argoproj.github.io/argo-helm
helm upgrade --install argo-rollouts argo/argo-rollouts --version 0.3.7
sleep 10
kubectl apply -f https://gist.githubusercontent.com/abatilo/e98186261db7e555f962c9657e9abd2c/raw/b98d5e779e428901af9a708eba3b63012e869069/argo-rollouts-issue-repro.yml
sleep 60
kubectl argo rollouts set image nginx nginx=nginx:1.19

The moment the AnalysisRun fails you'll see:

⇒  kubectl argo rollouts get rollout nginx

Name:            nginx
Namespace:       default
Status:          ✖ Degraded
Strategy:        Canary
  Step:          0/2
  SetWeight:     0
  ActualWeight:  0
Images:          nginx:1.18 (stable)
Replicas:
  Desired:       10
  Current:       10
  Updated:       0
  Ready:         1
  Available:     1

NAME                                                                KIND         STATUS               AGE  INFO
⟳ nginx                                                             Rollout      ✖ Degraded           97s
├──# revision:2
│  ├──⧉ nginx-56696d5c7b                                            ReplicaSet   • ScaledDown         37s  canary
│  │  ├──□ nginx-56696d5c7b-88k94                                   Pod          ◌ Terminating        37s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-k757l                                   Pod          ◌ Terminating        37s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-xkpbt                                   Pod          ◌ Terminating        37s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-tjnt5                                   Pod          ◌ Terminating        29s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-nw7r6                                   Pod          ◌ Terminating        27s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-d68h9                                   Pod          ◌ Terminating        25s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-lfmb8                                   Pod          ◌ Terminating        23s  ready:1/1
│  │  ├──□ nginx-56696d5c7b-w2qb7                                   Pod          ◌ Terminating        21s  ready:1/1
│  │  └──□ nginx-56696d5c7b-6mgzb                                   Pod          ◌ Terminating        19s  ready:1/1
│  └──α nginx-56696d5c7b-2-1                                        AnalysisRun  ✖ Failed             7s   ✖ 1
│     └──⊞ 816cb747-8913-478c-a2ec-4509d505ce0b.integrationtests.1  Job          ✖ Failed             7s
└──# revision:1
   └──⧉ nginx-7b54fb9677                                            ReplicaSet   ◌ Progressing        77s  stable
      ├──□ nginx-7b54fb9677-64hxc                                   Pod          ◌ Terminating        77s  ready:0/1
      ├──□ nginx-7b54fb9677-lxbzd                                   Pod          ✔ Running            77s  ready:1/1
      ├──□ nginx-7b54fb9677-2rvtk                                   Pod          ◌ ContainerCreating  2s   ready:0/1
      ├──□ nginx-7b54fb9677-66qcn                                   Pod          ◌ ContainerCreating  2s   ready:0/1
      ├──□ nginx-7b54fb9677-6b7p9                                   Pod          ◌ ContainerCreating  2s   ready:0/1
      ├──□ nginx-7b54fb9677-bmjdw                                   Pod          ◌ Pending            2s   ready:0/1
      ├──□ nginx-7b54fb9677-frz9q                                   Pod          ◌ ContainerCreating  2s   ready:0/1
      ├──□ nginx-7b54fb9677-hks7c                                   Pod          ◌ Pending            2s   ready:0/1
      ├──□ nginx-7b54fb9677-krtbt                                   Pod          ◌ ContainerCreating  2s   ready:0/1
      ├──□ nginx-7b54fb9677-n552b                                   Pod          ◌ Pending            2s   ready:0/1
      └──□ nginx-7b54fb9677-tkpkt                                   Pod          ◌ ContainerCreating  2s   ready:0/1
⇒  kubectl get replicaset
NAME                       DESIRED   CURRENT   READY   AGE
argo-rollouts-5866b5c57c   1         1         1       98s
nginx-56696d5c7b           0         0         0       37s
nginx-7b54fb9677           10        10        1       77s
⇒  kubectl get pods
NAME                                                            READY   STATUS              RESTARTS   AGE
816cb747-8913-478c-a2ec-4509d505ce0b.integrationtests.1-h888z   0/1     Error               0          8s
argo-rollouts-5866b5c57c-lmkxc                                  1/1     Running             0          99s
nginx-56696d5c7b-6mgzb                                          1/1     Terminating         0          20s
nginx-56696d5c7b-88k94                                          1/1     Terminating         0          38s
nginx-56696d5c7b-d68h9                                          1/1     Terminating         0          26s
nginx-56696d5c7b-k757l                                          1/1     Terminating         0          38s
nginx-56696d5c7b-lfmb8                                          1/1     Terminating         0          24s
nginx-56696d5c7b-nw7r6                                          1/1     Terminating         0          28s
nginx-56696d5c7b-tjnt5                                          1/1     Terminating         0          30s
nginx-56696d5c7b-w2qb7                                          1/1     Terminating         0          22s
nginx-56696d5c7b-xkpbt                                          1/1     Terminating         0          38s
nginx-7b54fb9677-2rvtk                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-64hxc                                          0/1     Terminating         0          78s
nginx-7b54fb9677-66qcn                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-6b7p9                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-bmjdw                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-frz9q                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-hks7c                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-krtbt                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-lxbzd                                          1/1     Running             0          78s
nginx-7b54fb9677-n552b                                          0/1     ContainerCreating   0          3s
nginx-7b54fb9677-tkpkt                                          0/1     ContainerCreating   0          3s

All of the canary pods are terminated immediately and you're left with close to 0 ready pods.

Paste the logs from the rollout controller

Logs for the entire controller:

⇒  kubectl logs deployment/argo-rollouts | gist -d "Argo Rollout logs"
https://gist.github.com/a4ff10e0c07013f824594f7d84b21c4e

https://gist.github.com/a4ff10e0c07013f824594f7d84b21c4e

Logs for a specific rollout:

⇒  kubectl logs deployment/argo-rollouts | grep rollout=nginx | gist -d "Argo Rollouts specific rollout logs"
https://gist.github.com/b13fae524c2192da376643a425dadeba

https://gist.github.com/b13fae524c2192da376643a425dadeba


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@abatilo abatilo added the bug Something isn't working label Oct 13, 2020
@jessesuen jessesuen self-assigned this Oct 13, 2020
@jessesuen
Copy link
Member

jessesuen commented Oct 15, 2020

I've identified this as a regression between v0.9.0 and v0.9.1.

Trying to identify a change which may have broken this.

@jessesuen jessesuen added the regression Regression from previous behavior label Oct 15, 2020
@jessesuen
Copy link
Member

jessesuen commented Oct 15, 2020

Looks like my change in GetReplicasForScaleDown is what broke it:

// GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
// isStableRS indicates if the supplied ReplicaSet is the stableRS
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, isStableRS bool) int32 {
	if rs == nil {
		return int32(0)
	}
	if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
		// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
		// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
		// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
		// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
		// violate the min available.
		return *rs.Spec.Replicas
	}
	if isStableRS && rs.Status.AvailableReplicas < *rs.Spec.Replicas {     // this logic should not apply
		// The stable ReplicaSet might be scaled up, but its pods may be unavailable.
		// In this case we need to return the spec.Replicas so that the controller will still
		// consider scaling down this ReplicaSet. Without this, a rollout update could become stuck
		// not scaling down the stable, in order to make room for more canaries.
		return *rs.Spec.Replicas
	}
	return rs.Status.AvailableReplicas
}

#739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cherry-pick/release-0.9 regression Regression from previous behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants