-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canary Rollout with errors when there's an scale event during the experiment #1596
Comments
Hi, @flaviolemos78 I'm trying to understand the issue/reproduce it. Could you help me? |
consider a Rollout with the following spec: spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/instance: stable
strategy:
canary:
maxSurge: 25
maxUnavailable: 0
steps:
- experiment:
duration: 30m
templates:
- metadata:
labels:
app.kubernetes.io/instance: experiment
name: experiment
replicas: 1
selector:
matchLabels:
app.kubernetes.io/instance: experiment
specRef: canary During the rollout, the experiment step will start just fine, then if you go to the Rollout, and change the desired replicas to 4 (same that HPA does -> scaling event), the reconciler will try to apply the desired state, and it does with success, but then in later reconciliations, it will enter in an error loop when trying to reconcile the experiment on step 0 |
Fixes #1596 (#1597) * fix: rollout experiment template changing reference rs template labels Signed-off-by: Flavio Lemos <[email protected]> * docs: Add Farfetch to USERS.md Signed-off-by: Flavio Lemos <[email protected]>
Fixes #1596 (#1597) * fix: rollout experiment template changing reference rs template labels Signed-off-by: Flavio Lemos <[email protected]> * docs: Add Farfetch to USERS.md Signed-off-by: Flavio Lemos <[email protected]>
Summary
When using canary rollout with experiments, if during the experiment there's an scale event, the rollout
status.canary.experiment
metadata is deleted. On the next reconcile cycle, the canary reconciler will try to (re)create/recue the experiment which results in an error. The error continues on further reconciles until the canary step is incremented. Since every time a reconcile job fails, it gets requeued with an exponential backoff retry. In the limit, any rollout change can take up to 5 minutes to take effect, e.g: auto-scale events, promoting a canary....Diagnostics
Noticed the errors on argo-rollouts 0.10.2 and on 1.1.0
rollout.status.canary.experiment
is deleted after a scale event:Then on the next reconcile, the controller tries to "rescue" the canary experiment, but it fails when trying the canary rs changes:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: