Argo rollouts often hangs with long running deploys #1193

MarkSRobinson · 2021-05-17T18:07:28Z

Summary

When deploying roll outs with a long step time (>10m), Argo rollouts will often hang on the last step.

AR will get to 100% deployed, with all traffic on the new pods but will get stuck tearing down the old pods. Using kubectl plugin, the message Message: old replicas are pending termination will be displayed for a considerable amount of time (10-30 minutes). Eventually Argo will be able to terminate the pods.

Ideally Argo should be able to kill these pods in less time.

Diagnostics

Argo: 0.10.2
K8s: 1.17

There's a lot of log data
https://gist.github.com/MarkSRobinson/e44ce06689aa02dd2d7886482c152fe2

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

jessesuen · 2021-05-17T19:36:00Z

Are you using scaleDownDelaySeconds?
Which version of the plugin? Is that also 0.10.2 as well?

MarkSRobinson · 2021-05-17T23:26:50Z

I'm not using scaleDownDelaySeconds and the plugin version is v0.10.0+3837eab

MarkSRobinson · 2021-05-17T23:27:19Z

My setup is using canary.

jessesuen · 2021-05-17T23:31:32Z

Are you using canary + traffic routing? There is a fix in v1.0 plugin which may address this:

31afa28#diff-e6177f3a3015ec10cd7a2cfcb6293603125bbfb53bfdbd51f342507d084e6d2e

You can download v1.0-rc1 plugin here:
https://github.com/argoproj/argo-rollouts/releases/tag/v1.0.0-rc1

MarkSRobinson · 2021-05-17T23:40:15Z

Yes, I'm using canary with traffic routing (linkerD).

The problem really isn't with the plugin, it's that Argo can't scale down old pods for a considerable time.

jessesuen · 2021-05-18T00:58:28Z

Argo can't scale down old pods for a considerable time.

Argo should set the spec.replicas of the old ReplicaSet stack to 0 shortly after the promotion of the canary. After that step, it's really up to the Pods to shut themselves down since they will get SIGTERMed by kubelet while honoring the pod's grace period seconds before getting SIGKILLed by kubelet.

If you see that the old ReplicaSet has spec.replicas: 0, then Argo Rollouts has done its job and scaledown time is affected by other factors not controllable by Argo Rollouts. However, if spec.replicas of the old ReplicaSet is non-zero for a long time after the canary promotion, then there may be some type of issue.

FYI, we should emit Kubernetes events for these events so that will aide in the debugging.

MarkSRobinson · 2021-05-18T02:34:17Z

The replica set remains set to non-zero for a considerable amount of time. In fact, I see an error about being unable to update the replicaset at the same time Argo moves to the "waiting for pods to terminate state". I suspect that Argo is trying to update the RS, failing, and then not retrying.

What events do you want me to extract from the event history?

jessesuen · 2021-05-18T09:38:56Z

I suspect that Argo is trying to update the RS, failing, and then not retrying.

This is a good theory, especially if you notice the error in logs. Rollout has a default resync period of 15m, which is actually somewhat excessive and I think the default can be reduced to 5m. To test your theory, a workaround could be to decrease this to a lower value through the CLI option to the controller. e.g. 5 minutes:

rollouts-controller --rollout-resync 300

The above setting will configure it such that if an error updating the ReplicaSet occurs, and for some reason we don't retry immediately (which is the expected behavior), then it will take at most 5m before it attempts it again.

MarkSRobinson · 2021-05-18T16:25:18Z

Would it cause problem to drop the resync down to like 60s?

jessesuen · 2021-05-18T17:45:59Z

Would it cause problem to drop the resync down to like 60s?

The lower the number, the more frequently there are reconciliations. More frequent reconciliations means:

higher CPU for the rollout controller
highter memory requirements (if you increase workers)
potentially more API calls. though, we try to reduce this as much as possible using local cache (i.e. kubernetes informers)

So I would say it depends somewhat on the number of rollouts in the system.

At the end of the day, 60s probably will not cause problems other than additional CPU resources, but I would keep an eye out for potentially more API calls. We have a prometheus metric to track this if you need to measure.

jessesuen · 2021-05-18T17:57:24Z

Let us know if reducing the resync period helped the problem. It will help pinpoint the cause of this bug (e.g. it implies we are not requeuing rollouts on errors properly)

jessesuen · 2021-05-18T18:02:11Z

I actually think this is no longer a problem in v1.0. In v1.0, we introduced a scaleDownDelay for canary+trafficRouting. In v1.0 there is now a (configurable) default scaleDownDelay of 30s before the rollout scales down the old stack.

The reason for leaving the old stack running for 30s, is to give a chance for service meshes & ingress controllers, to adjust/propagate the traffic weight changes which the rollout had made to the underlying network objects. Before scaleDownDelay, we were scaling down the old stack immediately after promoting the canary, which could cause brief 500 errors if the mesh provider hadn't yet fully made the weight changes and the rollout pods of the old stack started shutting down.

In other words, the the whole process of scaling down the old replicaset has changed in v1.0 and so this bug is probably not applicable anymore.

MarkSRobinson · 2021-05-19T17:56:38Z

@jessesuen I've tried setting the rollout-resync to 60s and it hasn't improved the termination latency.

I can investigate upgrading the argo-controller to 1.0-rc. Is there a timeline for a 1.0 release?

jessesuen · 2021-05-21T08:48:08Z

I can investigate upgrading the argo-controller to 1.0-rc. Is there a timeline for a 1.0 release?

That would be helpful! v1.0 is now released!

MarkSRobinson · 2021-06-01T22:35:39Z

Ok, I've tried with the new 1.0.1 release and I'm still seeing the behaviour described above.

MarkSRobinson · 2021-06-02T02:09:50Z

It turns out there is a bug in the use of the workqueue for rollouts. Something is adding the rollout object back to the queue dozens of times. This causes the exponential back-off queue to basically immediately hit the 16 minute limit (!).

I added some logging in utils/controller/controller.go:processNextWorkItem() and got

time="2021-06-02T01:58:47Z" level=error msg="rollout syncHandler obj retries: 55 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

&nbsp; | time="2021-06-02T01:58:41Z" level=error msg="rollout syncHandler obj retries: 28 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

&nbsp; | time="2021-06-02T01:57:43Z" level=error msg="rollout syncHandler obj retries: 3 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

Code I used to check

logCtx.Errorf("%s syncHandler obj retries: %v : key \"%#v\"", objType, workqueue.NumRequeues(key), key)

Something is calling workqueue.AddRateLimited(key) every other second. The result is that the back-off hits the ceiling almost instantly - and the limit is 16 minutes.

MarkSRobinson · 2021-06-02T16:04:11Z

I've got a mitigation for the issue - #1243

…queue. #1193 (#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]>

…queue. argoproj#1193 (argoproj#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]> Signed-off-by: caoyang001 <[email protected]>

MarkSRobinson · 2021-06-15T23:11:54Z

Fixed in #1243

…queue. argoproj#1193 (argoproj#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]> Signed-off-by: caoyang001 <[email protected]>

MarkSRobinson added the bug Something isn't working label May 17, 2021

MarkSRobinson changed the title ~~Argo rollouts often hangs will long running deploys~~ Argo rollouts often hangs with long running deploys May 17, 2021

jessesuen added the more-information-needed More information is needed to make progress label May 18, 2021

no-response bot removed the more-information-needed More information is needed to make progress label May 19, 2021

MarkSRobinson closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argo rollouts often hangs with long running deploys #1193

Argo rollouts often hangs with long running deploys #1193

MarkSRobinson commented May 17, 2021

jessesuen commented May 17, 2021

MarkSRobinson commented May 17, 2021

MarkSRobinson commented May 17, 2021

jessesuen commented May 17, 2021

MarkSRobinson commented May 17, 2021

jessesuen commented May 18, 2021

MarkSRobinson commented May 18, 2021

jessesuen commented May 18, 2021

MarkSRobinson commented May 18, 2021

jessesuen commented May 18, 2021

jessesuen commented May 18, 2021

jessesuen commented May 18, 2021 •

edited

Loading

MarkSRobinson commented May 19, 2021

jessesuen commented May 21, 2021

MarkSRobinson commented Jun 1, 2021

MarkSRobinson commented Jun 2, 2021

MarkSRobinson commented Jun 2, 2021 •

edited

Loading

MarkSRobinson commented Jun 15, 2021

Argo rollouts often hangs with long running deploys #1193

Argo rollouts often hangs with long running deploys #1193

Comments

MarkSRobinson commented May 17, 2021

Summary

Diagnostics

jessesuen commented May 17, 2021

MarkSRobinson commented May 17, 2021

MarkSRobinson commented May 17, 2021

jessesuen commented May 17, 2021

MarkSRobinson commented May 17, 2021

jessesuen commented May 18, 2021

MarkSRobinson commented May 18, 2021

jessesuen commented May 18, 2021

MarkSRobinson commented May 18, 2021

jessesuen commented May 18, 2021

jessesuen commented May 18, 2021

jessesuen commented May 18, 2021 • edited Loading

MarkSRobinson commented May 19, 2021

jessesuen commented May 21, 2021

MarkSRobinson commented Jun 1, 2021

MarkSRobinson commented Jun 2, 2021

MarkSRobinson commented Jun 2, 2021 • edited Loading

MarkSRobinson commented Jun 15, 2021

jessesuen commented May 18, 2021 •

edited

Loading

MarkSRobinson commented Jun 2, 2021 •

edited

Loading