fix:scale down not available stableRS #735

wangzhipeng · 2020-09-24T10:18:04Z

After the program version of crashloopbackoff is deployed for the first time, the number of copies of the old version is not deleted after rolling out the version that can be run

codecov-commenter · 2020-09-24T10:23:16Z

Codecov Report

Merging #735 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #735   +/-   ##
=======================================
  Coverage   83.04%   83.05%           
=======================================
  Files          95       95           
  Lines        7946     7950    +4     
=======================================
+ Hits         6599     6603    +4     
  Misses        946      946           
  Partials      401      401

Impacted Files	Coverage Δ
utils/replicaset/canary.go	`78.44% <100.00%> (+0.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed86374...4b05b57. Read the comment docs.

jessesuen · 2020-09-24T19:40:24Z

@wangzhipeng thanks for your contribution. Just so I can understand the bug a bit better, can you provide the steps to reproduce the issue? I think I may want to write an e2e test for this later.

wangzhipeng · 2020-09-25T01:33:49Z

create revision1.yaml ,this command is fail

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ping
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: ping
  template:
    metadata:
      labels:
        app: ping
    spec:
      containers:
      - name: ping
        image: alpine
        command:
        - pin
        args:
        - baidu.com
  strategy:
    canary:
      steps:
      - setWeight: 100

kubectl apply -f revision1.yaml

fix command, edit revision1.yaml to revision2.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ping
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: ping
  template:
    metadata:
      labels:
        app: ping
    spec:
      containers:
      - name: ping
        image: alpine
        command:
        - ping
        args:
        - baidu.com
  strategy:
    canary:
      steps:
      - setWeight: 100

kubectl apply -f revision2.yaml

this ping revision1 rs Replica is 2 ,not scale down

jessesuen · 2020-09-25T20:39:40Z

Thanks. This is a good find. I was able to reproduce the problem and was able to write an e2e test to catch the issue. However, I think I have some questions I need to ask @dthomson25 about regarding the implementation to see if this is the right way to fix the problems.

jessesuen · 2020-09-25T20:42:07Z

utils/replicaset/canary.go

+	stableRSReplicasForScaleDown := GetReplicasForScaleDown(stableRS)
+	scaleDownCount := GetReplicasForScaleDown(newRS) + stableRSReplicasForScaleDown + totalAvailableOlderReplicaCount - minAvailableReplicaCount
+
+	if scaleDownCount <= 0 && stableRSReplicasForScaleDown == 0 {


Although this works for the scenario you provided, I don't think this is a complete fix, because stableRSReplicasForScaleDown == 0 is just one case and I think it would be possible for stableRSReplicasForScaleDown > 0 and we would still face the problem. E.g. for the initial version, 1 out of 2 pods were available, but the other was CrashLoopBackoff-ing

jessesuen · 2020-09-25T20:50:19Z

@dthomson25 - the current implementation of GetReplicasForScaleDown returns either Spec.Replicas, or Status.AvailableReplicas:

// GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
	if rs == nil {
		return int32(0)
	}
	if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
		// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
		// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
		// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
		// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
		// violate the min available.
		return *rs.Spec.Replicas
	}
	return rs.Status.AvailableReplicas
}

However, it's possible for a situation where Spec.Replicas >= Status.Replicas > Status.AvailableReplicas, in which case we would choose Status.AvailableReplicas

The real world example is a stableRS in this state:

spec:
  replicas: 2
status:
  replicas: 2
  availableReplicas: 0

Here we would say there are zero replicas available in stable to scale down, even though status.replicas: 2

My question is: why isn't GetReplicasForScaleDown just always returning spec.replicas? I tried making the change but then unit tests started failing.

One change I made which did pass unit tests was:

// GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
	if rs == nil {
		return int32(0)
	}
	if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
		// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
		// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
		// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
		// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
		// violate the min available.
		return *rs.Spec.Replicas
	}
	if rs.Status.AvailableReplicas < rs.Status.Replicas {
		return rs.Status.Replicas
	}
	return rs.Status.AvailableReplicas
}

But I'm not sure it was the correct solution.

dthomson25 · 2020-09-25T21:50:38Z

Based on https://github.com/argoproj/argo-rollouts/pull/141/files, I believe that GetReplicasForScaleDown returns the RS's Status.AvailableReplicas sometimes to make sure it honor's the Rollout's maxUnavailable value, and I think we need to make sure that the fix still upholds that maxUnavailable. I think part of the issue is that the replicaset 1 is being set to stable without it ever becoming healthy

jessesuen · 2020-09-26T03:00:35Z

One change I made which did pass unit tests was:

The "fix" I attempted below (while it passes unit tests), is entirely unsafe, and is an incorrect solution:

	if rs.Status.AvailableReplicas < rs.Status.Replicas {
		return rs.Status.Replicas
	}
	return rs.Status.AvailableReplicas
}

It fails to honor maxSurge and maxUnavailable and undoes the fix in #141

jessesuen · 2020-09-27T09:19:44Z

OK looking into this very deeply for several days, and after extensive testing, I can say this PR doesn't cover all scenarios:

it only works when the stableRS is 100% completely unavailable (availableReplicas == 0)
when the new logic kicks in, it scales the canary/stable counts immediately to the desired setWeight without honoring maxSurge/Unavailable, which it should still do even if availableReplicas == 0

I've filed an alternative fix here #739, which addresses the above

wangzhipeng · 2020-09-28T01:39:27Z

@jessesuen good , close this PR

fix:scale down not available stableRS

4b05b57

jessesuen reviewed Sep 25, 2020

View reviewed changes

jessesuen mentioned this pull request Sep 27, 2020

fix: unavailable stable RS was not scaled down to make room for canary #739

Merged

wangzhipeng closed this Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix:scale down not available stableRS #735

fix:scale down not available stableRS #735

wangzhipeng commented Sep 24, 2020

codecov-commenter commented Sep 24, 2020

jessesuen commented Sep 24, 2020

wangzhipeng commented Sep 25, 2020

jessesuen commented Sep 25, 2020

jessesuen Sep 25, 2020

jessesuen commented Sep 25, 2020

dthomson25 commented Sep 25, 2020

jessesuen commented Sep 26, 2020

jessesuen commented Sep 27, 2020

wangzhipeng commented Sep 28, 2020

fix:scale down not available stableRS #735

fix:scale down not available stableRS #735

Conversation

wangzhipeng commented Sep 24, 2020

codecov-commenter commented Sep 24, 2020

Codecov Report

jessesuen commented Sep 24, 2020

wangzhipeng commented Sep 25, 2020

jessesuen commented Sep 25, 2020

jessesuen Sep 25, 2020

Choose a reason for hiding this comment

jessesuen commented Sep 25, 2020

dthomson25 commented Sep 25, 2020

jessesuen commented Sep 26, 2020

jessesuen commented Sep 27, 2020

wangzhipeng commented Sep 28, 2020