You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We sometimes observe the problem where Rollout would be permanently stuck in degraded state due to the error:
Message: RolloutAborted: Rollout aborted update to revision 2: Unable to scale ReplicaSet for template 'canary-preview' to desired replica count '1': Operation cannot be fulfilled on replicasets.apps "podinfo-test-argo-rollouts-55848cf857-2-0-canary-preview": the object has been modified; please apply your changes to the latest version and try again
This sometimes happens when experiment tries to scale up preview replica. The issue is very rare. We run a lot of tests, and seen this multiple times already, but when I intentionally wanted to reproduce it to collect info, the test was run ~90 times before it happened.
Reproduction steps
Apply manifests (attached)
Wait for rollout to become complete
Immediately run kubectl patch rollout podinfo-test-argo-rollouts --type=json '-p=[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"stefanprodan/podinfo:3.1.1"}]'
Pray for reproduction
Diagnostics
Argo Rollouts version is v.1.1.1.
k8s version 1.21 (EKS).
After reproducing the issue, I've tried to collect every possible debug info: objects output, controller logs, etc.
All of that is attached:
The issue is that resource conflict errors are a fact of life, and the experiment controller needs to accommodate this possibility by retrying/re-reconciling the experiment. I think the fix for this should be easy.
func (ec*experimentContext) scaleTemplateRS(rs*appsv1.ReplicaSet, template v1alpha1.TemplateSpec, templateStatus*v1alpha1.TemplateStatus, desiredReplicaCountint32, experimentReplicasint32) {
...
_, _, err:=ec.scaleReplicaSetAndRecordEvent(rs, desiredReplicaCount)
iferr!=nil { // check if this is resource conflict error and don't fail.templateStatus.Status=v1alpha1.TemplateStatusErrortemplateStatus.Message=fmt.Sprintf("Unable to scale ReplicaSet for template '%s' to desired replica count '%v': %v", templateStatus.Name, desiredReplicaCount, err)
} else {
...
}
}
Summary
We sometimes observe the problem where Rollout would be permanently stuck in degraded state due to the error:
This sometimes happens when experiment tries to scale up preview replica. The issue is very rare. We run a lot of tests, and seen this multiple times already, but when I intentionally wanted to reproduce it to collect info, the test was run ~90 times before it happened.
Reproduction steps
kubectl patch rollout podinfo-test-argo-rollouts --type=json '-p=[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"stefanprodan/podinfo:3.1.1"}]'
Diagnostics
Argo Rollouts version is v.1.1.1.
k8s version 1.21 (EKS).
After reproducing the issue, I've tried to collect every possible debug info: objects output, controller logs, etc.
All of that is attached:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: