-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote-full with traffic splitting doesn't wait for new pods to be ready #1580
Comments
This would be a bug. It is intended to behave the way you expected it to. |
Hi @khhirani, is there any other information I can provide to help narrow down this issue and get a fix in place? It's a pretty serious issue and constrains the ability to promote. A possible workaround would be to sequentially promote quickly through all the canary steps but #1581 prevents doing that easily in the dashboard which makes it harder for end users. Thanks! |
@bpoland Sorry for the delay! Could you please attach your Rollout spec as well as the controller logs? I'm not able to recreate this issue locally, so I'd like to try it with your rollout. |
@khhirani I am able to reproduce the issue consistently on my side, here are the logs from just now when I did it: Controller log: https://gist.github.com/bpoland/e63bfa02f2c7118e7060b56f2f74e14d The deployment was initiated at At that point even though only 1/5 pods were ready, the controller immediately shifted 100% of traffic to the new RS and scaled down the old RS, as you can see in the controller log. Please let me know if I can provide any other information to help reproduce. We are using linkerd for traffic splitting, if that makes a difference. Thanks! |
@jessesuen We are able to reproduce but behavior is same with previous versions as well. I wanted to understand what would be the behavior.
And in code trafficrouting.go L110
|
When not using It seems like the important part is that |
even without dynamicStableScale, weight is updated when doing promote-full, the load balancer will send 100% traffic to the canary. |
Ahh sorry I was confusing things in my head, yes I see what you mean now. So without dynamicStableScale does the "old" replicaset also get scaled down immediately then? If so, this seems like a pretty serious issue with promote-full... |
Without dynamicStableScale, old replica isn't scaling down immediately. That is an expected behavior |
discussed with @jessesuen. Fix will be to wait for all pods to come up and then update the weight.
|
Hey folks, I've tested the fix and it does prevent the old pods from being scaled down early which is great -- thanks! Unfortunately it looks like there's still an issue with the traffic weighting -- as soon as I promote-full, the traffic split is updated to show both set and actual weight as 100, and the linkerd traffic split shows that it's sending 100% of traffic to the stable version. This means that all traffic gets temporarily sent back to the old/broken version until all the new canary pods become ready, and then all traffic is switched to the new version at once when the canary version gets promoted to stable. Instead, I would expect the "actual" weight sent to canary to keep ramping up towards 100% as the canary pods become ready. Then promote the canary RS to stable, and update traffic split so that the new pods continue to get 100% Please let me know if it makes more sense to reopen this issue or whether I should submit a new issue for this. Thanks! |
Oh and this also means that even with dynamicStableScale enabled, all of the old/stable version pods remain up until the final switch happens. If the "actual" weight were being properly adjusted as the new canary pods came up, I think the old pods would start to scale down as more new pods became ready, right? |
@bpoland when you do promote-full, at that time, traffic weight wont get updated. that will be a limitation for promote-full and weights won't get automatically adjusted when canary pods are coming up. |
and for dynamicStableScale, yes..when canary pods are coming up, stable pods gets adjusted |
Hmm that's not what I'm seeing -- as soon as I promote-full the linkerd traffic split gets updated to 100 stable, 0 canary. I see that in the linkerd console, and it matches what the rollout is showing (set and actual weight are both 100 right after the promote). I thought this type of situation was what the set/actual weight differences were for -- I would have expected the set weight to be 100 right after the promote-full, and then the actual weight to continue to increase as the new canary pods become ready. What is the difference between set and actual weight then? |
Created a follow-up since I think things are still broken after this fix. |
Summary
When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.
Instead, it should immediately scale up the new RS to full size, but wait to adjust the traffic split as the new pods become ready. I had thought this was the difference between set weight and actual weight, but that doesn't seem to be the case (maybe someone could help me understand what the difference is then?)
As a side effect when using the new
dynamicStableScale
feature in 1.1, this means that the old RS gets immediately scaled down and we could be left with very few running/ready pods. I think this is a symptom of the above root cause though, and I guess the traffic split is set to send all traffic to the new RS anyway.Diagnostics
Rollouts 1.1
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: