Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:scale down not available stableRS #735

Closed
wants to merge 1 commit into from

Conversation

wangzhipeng
Copy link

After the program version of crashloopbackoff is deployed for the first time, the number of copies of the old version is not deleted after rolling out the version that can be run

@codecov-commenter
Copy link

Codecov Report

Merging #735 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #735   +/-   ##
=======================================
  Coverage   83.04%   83.05%           
=======================================
  Files          95       95           
  Lines        7946     7950    +4     
=======================================
+ Hits         6599     6603    +4     
  Misses        946      946           
  Partials      401      401           
Impacted Files Coverage Δ
utils/replicaset/canary.go 78.44% <100.00%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed86374...4b05b57. Read the comment docs.

@jessesuen
Copy link
Member

@wangzhipeng thanks for your contribution. Just so I can understand the bug a bit better, can you provide the steps to reproduce the issue? I think I may want to write an e2e test for this later.

@wangzhipeng
Copy link
Author

create revision1.yaml ,this command is fail

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ping
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: ping
  template:
    metadata:
      labels:
        app: ping
    spec:
      containers:
      - name: ping
        image: alpine
        command:
        - pin
        args:
        - baidu.com
  strategy:
    canary:
      steps:
      - setWeight: 100
kubectl apply -f revision1.yaml

fix command, edit revision1.yaml to revision2.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ping
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: ping
  template:
    metadata:
      labels:
        app: ping
    spec:
      containers:
      - name: ping
        image: alpine
        command:
        - ping
        args:
        - baidu.com
  strategy:
    canary:
      steps:
      - setWeight: 100
kubectl apply -f revision2.yaml

this ping revision1 rs Replica is 2 ,not scale down

@jessesuen
Copy link
Member

Thanks. This is a good find. I was able to reproduce the problem and was able to write an e2e test to catch the issue. However, I think I have some questions I need to ask @dthomson25 about regarding the implementation to see if this is the right way to fix the problems.

stableRSReplicasForScaleDown := GetReplicasForScaleDown(stableRS)
scaleDownCount := GetReplicasForScaleDown(newRS) + stableRSReplicasForScaleDown + totalAvailableOlderReplicaCount - minAvailableReplicaCount

if scaleDownCount <= 0 && stableRSReplicasForScaleDown == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this works for the scenario you provided, I don't think this is a complete fix, because stableRSReplicasForScaleDown == 0 is just one case and I think it would be possible for stableRSReplicasForScaleDown > 0 and we would still face the problem. E.g. for the initial version, 1 out of 2 pods were available, but the other was CrashLoopBackoff-ing

@jessesuen
Copy link
Member

@dthomson25 - the current implementation of GetReplicasForScaleDown returns either Spec.Replicas, or Status.AvailableReplicas:

// GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
	if rs == nil {
		return int32(0)
	}
	if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
		// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
		// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
		// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
		// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
		// violate the min available.
		return *rs.Spec.Replicas
	}
	return rs.Status.AvailableReplicas
}

However, it's possible for a situation where Spec.Replicas >= Status.Replicas > Status.AvailableReplicas, in which case we would choose Status.AvailableReplicas

The real world example is a stableRS in this state:

spec:
  replicas: 2
status:
  replicas: 2
  availableReplicas: 0

Here we would say there are zero replicas available in stable to scale down, even though status.replicas: 2

My question is: why isn't GetReplicasForScaleDown just always returning spec.replicas? I tried making the change but then unit tests started failing.

One change I made which did pass unit tests was:

// GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
	if rs == nil {
		return int32(0)
	}
	if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
		// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
		// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
		// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
		// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
		// violate the min available.
		return *rs.Spec.Replicas
	}
	if rs.Status.AvailableReplicas < rs.Status.Replicas {
		return rs.Status.Replicas
	}
	return rs.Status.AvailableReplicas
}

But I'm not sure it was the correct solution.

@dthomson25
Copy link
Member

Based on https://github.com/argoproj/argo-rollouts/pull/141/files, I believe that GetReplicasForScaleDown returns the RS's Status.AvailableReplicas sometimes to make sure it honor's the Rollout's maxUnavailable value, and I think we need to make sure that the fix still upholds that maxUnavailable. I think part of the issue is that the replicaset 1 is being set to stable without it ever becoming healthy

@jessesuen
Copy link
Member

One change I made which did pass unit tests was:

The "fix" I attempted below (while it passes unit tests), is entirely unsafe, and is an incorrect solution:

	if rs.Status.AvailableReplicas < rs.Status.Replicas {
		return rs.Status.Replicas
	}
	return rs.Status.AvailableReplicas
}

It fails to honor maxSurge and maxUnavailable and undoes the fix in #141

@jessesuen
Copy link
Member

OK looking into this very deeply for several days, and after extensive testing, I can say this PR doesn't cover all scenarios:

  • it only works when the stableRS is 100% completely unavailable (availableReplicas == 0)
  • when the new logic kicks in, it scales the canary/stable counts immediately to the desired setWeight without honoring maxSurge/Unavailable, which it should still do even if availableReplicas == 0

I've filed an alternative fix here #739, which addresses the above

@wangzhipeng
Copy link
Author

@jessesuen good , close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants