-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix:scale down not available stableRS #735
Conversation
Codecov Report
@@ Coverage Diff @@
## master #735 +/- ##
=======================================
Coverage 83.04% 83.05%
=======================================
Files 95 95
Lines 7946 7950 +4
=======================================
+ Hits 6599 6603 +4
Misses 946 946
Partials 401 401
Continue to review full report at Codecov.
|
@wangzhipeng thanks for your contribution. Just so I can understand the bug a bit better, can you provide the steps to reproduce the issue? I think I may want to write an e2e test for this later. |
create revision1.yaml ,this
fix command, edit revision1.yaml to revision2.yaml
this ping revision1 rs Replica is 2 ,not scale down |
Thanks. This is a good find. I was able to reproduce the problem and was able to write an e2e test to catch the issue. However, I think I have some questions I need to ask @dthomson25 about regarding the implementation to see if this is the right way to fix the problems. |
stableRSReplicasForScaleDown := GetReplicasForScaleDown(stableRS) | ||
scaleDownCount := GetReplicasForScaleDown(newRS) + stableRSReplicasForScaleDown + totalAvailableOlderReplicaCount - minAvailableReplicaCount | ||
|
||
if scaleDownCount <= 0 && stableRSReplicasForScaleDown == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this works for the scenario you provided, I don't think this is a complete fix, because stableRSReplicasForScaleDown == 0
is just one case and I think it would be possible for stableRSReplicasForScaleDown > 0
and we would still face the problem. E.g. for the initial version, 1 out of 2 pods were available, but the other was CrashLoopBackoff-ing
@dthomson25 - the current implementation of GetReplicasForScaleDown returns either Spec.Replicas, or Status.AvailableReplicas: // GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
if rs == nil {
return int32(0)
}
if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
// violate the min available.
return *rs.Spec.Replicas
}
return rs.Status.AvailableReplicas
} However, it's possible for a situation where Spec.Replicas >= Status.Replicas > Status.AvailableReplicas, in which case we would choose The real world example is a spec:
replicas: 2
status:
replicas: 2
availableReplicas: 0 Here we would say there are zero replicas available in stable to scale down, even though My question is: why isn't One change I made which did pass unit tests was: // GetReplicasForScaleDown returns the number of replicas to consider for scaling down.
func GetReplicasForScaleDown(rs *appsv1.ReplicaSet, new bool) int32 {
if rs == nil {
return int32(0)
}
if *rs.Spec.Replicas < rs.Status.AvailableReplicas {
// The ReplicaSet is already going to scale down replicas since the availableReplica count is bigger
// than the spec count. The controller uses the .Spec.Replicas to prevent the controller from
// assuming the extra replicas (availableReplica - .Spec.Replicas) are going to remain available.
// Otherwise, the controller use those extra replicas to scale down more replicas and potentially
// violate the min available.
return *rs.Spec.Replicas
}
if rs.Status.AvailableReplicas < rs.Status.Replicas {
return rs.Status.Replicas
}
return rs.Status.AvailableReplicas
} But I'm not sure it was the correct solution. |
Based on https://github.com/argoproj/argo-rollouts/pull/141/files, I believe that GetReplicasForScaleDown returns the RS's Status.AvailableReplicas sometimes to make sure it honor's the Rollout's maxUnavailable value, and I think we need to make sure that the fix still upholds that maxUnavailable. I think part of the issue is that the replicaset 1 is being set to stable without it ever becoming healthy |
The "fix" I attempted below (while it passes unit tests), is entirely unsafe, and is an incorrect solution: if rs.Status.AvailableReplicas < rs.Status.Replicas {
return rs.Status.Replicas
}
return rs.Status.AvailableReplicas
} It fails to honor maxSurge and maxUnavailable and undoes the fix in #141 |
OK looking into this very deeply for several days, and after extensive testing, I can say this PR doesn't cover all scenarios:
I've filed an alternative fix here #739, which addresses the above |
@jessesuen good , close this PR |
After the program version of crashloopbackoff is deployed for the first time, the number of copies of the old version is not deleted after rolling out the version that can be run