-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
progressDeadlineAbort
causes degraded Rollout outside of a deployment
#1624
Comments
@bpoland can you also post the output of
|
Sorry I didn't get the exact one matching the above but the only difference will be the image used (which is redacted anyway :)) The rollout/rs below was in the same state when I grabbed them. rollout yaml: https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05 I've been investigating more and noticed that our containers are occasionally going not ready/moving around between k8s nodes which I think causes the rollout to go back into |
are the pods going from ready --> not ready? I have one question: Is rollout becoming healthy after pods become ready again? or is it keeping as degraded after that? |
Yes, some pods seem to be going ready, then not ready, then ready again. It seems like that might be causing the auto-abort. Once it gets into a degraded state because of the auto-abort, then it stays degraded even after all pods are ready again. Clicking the "retry" button in the dashboard (or I assume doing the same with the cli) immediately makes the rollout healthy again, but that shouldn't be necessary -- the auto-abort should do nothing except when there is a canary going on. |
Yes, I think that must be the case. /cc @huikang |
Maybe the Rollout should not be marked as |
I may be noticing a similar behavior with progressDeadlineAbort enabled. In my case sometimes when I rollout restart ( to recycle pods ) and when pods become unhealthy ( because probes failing since app is not fully up and running yet ) - the restart goes through successfully but marks rollout degraded. Also another thing I noticed after the restart was done and rollout was marked degraded, was that the rollouts controller kept adding and removing scaledown deadline annotation on the RS ( and was stuck in that loop until I deleted the entire rollout object - which is not great ) Below are the logs that continue after pods have successfully restarted, rollouts is marked degraded, the stable RS keeps getting updated to add/remove the scale down deadline annotation
|
@bpoland, are you using workloadRef in your rollout cuz you mentioned UPDATE: I think @bpoland uses workloadRef by looking at the spec https://gist.github.com/bpoland/d97a019c2b92a9495cbe81036bbaee05#file-rollout-yaml-L151 @agill17 are you facing the issue with workloadRef too? |
The logic to abort timeouted rollout is Lines 655 to 667 in dc1c11b
Looking into how to reproduce the error. |
I am not using Deployment but directly using Rollout object.
And here is a view from argo rollouts CLI
|
Sorry for the delayed response, but yes I am using workloadRef. Do you think that's related to the issue? Do you think the above PR will fix the issue when pods are going not ready and no rollout is in progress? |
no problem. Whether using
Yes, this PR should fix the issue. Please let me know what you will find. Thanks. |
Summary
It seems that with
progressDeadlineAbort: true
that some Rollouts randomly get marked as degraded when no deployment is in progress. This has been happening repeatedly since I turned onprogressDeadlineAbort
(and stops happening when I turn that back off).When the issue happens, the Rollout is left like this:
Luckily with the changes in 1.1 having the
SetWeight: 0
does not actually disrupt traffic but it is still concerning. I would expectprogressDeadlineAbort
not to abort after the deployment is complete.Diagnostics
Controller 1.1
Here are the redacted logs for a rollout that had the issue. You can see that the rollout gets aborted due to progress at 2021-11-04T14:46:54Z, nearly half an hour after the rollout completed at 2021-11-04T14:18:15Z
https://gist.github.com/bpoland/537129feb598ae208a01463aeb347464
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: