Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

Closed
bpoland opened this issue Oct 14, 2021 · 16 comments · Fixed by #1663
Closed

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

bpoland opened this issue Oct 14, 2021 · 16 comments · Fixed by #1663
Assignees
Labels
bug Something isn't working cherry-pick/release-1.1
Milestone

Comments

@bpoland
Copy link
Contributor

bpoland commented Oct 14, 2021

Summary

When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.

Instead, it should immediately scale up the new RS to full size, but wait to adjust the traffic split as the new pods become ready. I had thought this was the difference between set weight and actual weight, but that doesn't seem to be the case (maybe someone could help me understand what the difference is then?)

As a side effect when using the new dynamicStableScale feature in 1.1, this means that the old RS gets immediately scaled down and we could be left with very few running/ready pods. I think this is a symptom of the above root cause though, and I guess the traffic split is set to send all traffic to the new RS anyway.

Diagnostics

Rollouts 1.1


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@bpoland bpoland added the bug Something isn't working label Oct 14, 2021
@harikrongali harikrongali modified the milestone: v1.2 Oct 15, 2021
@jessesuen
Copy link
Member

When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.

This would be a bug. It is intended to behave the way you expected it to.

@bpoland
Copy link
Contributor Author

bpoland commented Nov 5, 2021

Hi @khhirani, is there any other information I can provide to help narrow down this issue and get a fix in place? It's a pretty serious issue and constrains the ability to promote. A possible workaround would be to sequentially promote quickly through all the canary steps but #1581 prevents doing that easily in the dashboard which makes it harder for end users.

Thanks!

@khhirani
Copy link
Contributor

khhirani commented Nov 8, 2021

@bpoland Sorry for the delay! Could you please attach your Rollout spec as well as the controller logs? I'm not able to recreate this issue locally, so I'd like to try it with your rollout.

@bpoland
Copy link
Contributor Author

bpoland commented Nov 9, 2021

@khhirani I am able to reproduce the issue consistently on my side, here are the logs from just now when I did it:

Controller log: https://gist.github.com/bpoland/e63bfa02f2c7118e7060b56f2f74e14d
Rollout yaml: https://gist.github.com/bpoland/f4628a6039a150e00bbfd2b042381248
Replicaset yaml (from after issue): https://gist.github.com/bpoland/4a1d951d8a9687cd0fa121ac8621b8d5

The deployment was initiated at 2021-11-09T00:00:07Z by changing the image and a new pod starts coming up. I waited until the first pod was ready and then initiated a promote-full around 2021-11-09T00:02:56Z

At that point even though only 1/5 pods were ready, the controller immediately shifted 100% of traffic to the new RS and scaled down the old RS, as you can see in the controller log.

Please let me know if I can provide any other information to help reproduce. We are using linkerd for traffic splitting, if that makes a difference. Thanks!

@harikrongali
Copy link
Contributor

harikrongali commented Nov 16, 2021

@jessesuen We are able to reproduce but behavior is same with previous versions as well. I wanted to understand what would be the behavior.
When promote full happens, status steps will be set as last step and step weight as 100% so when setting the desired weight, we set the weight at the step while we are bringing up the replicas.

Name:            alb-rollout
Namespace:       smi-bug
Status:          progressiing
Strategy:        Canary
  Step:          14/14
  SetWeight:     100
  ActualWeight:  100

And in code trafficrouting.go L110

if !atDesiredReplicaCount {
			// Use the previous weight since the new RS is not ready for a new weight
			for i := *index - 1; i >= 0; i-- {
				step := c.rollout.Spec.Strategy.Canary.Steps[i]
				if step.SetWeight != nil {
					desiredWeight = *step.SetWeight
					break
				}
			}

@bpoland
Copy link
Contributor Author

bpoland commented Nov 16, 2021

When not using dynamicStableScale then I don't think this is an issue right, because all the "old" pods should still be up and running so they should be able to immediately handle all traffic.

It seems like the important part is that ActualWeight shouldn't go higher than n, where n is floor(number of ready pods / total pods). Because I think that's what the scaledown should be based on right?

@harikrongali
Copy link
Contributor

even without dynamicStableScale, weight is updated when doing promote-full, the load balancer will send 100% traffic to the canary.

@bpoland
Copy link
Contributor Author

bpoland commented Nov 16, 2021

Ahh sorry I was confusing things in my head, yes I see what you mean now. So without dynamicStableScale does the "old" replicaset also get scaled down immediately then? If so, this seems like a pretty serious issue with promote-full...

@harikrongali
Copy link
Contributor

Without dynamicStableScale, old replica isn't scaling down immediately. That is an expected behavior

@harikrongali
Copy link
Contributor

harikrongali commented Nov 17, 2021

discussed with @jessesuen. Fix will be to wait for all pods to come up and then update the weight.
in the following block, we need to have another check for promote-full

if !atDesiredReplicaCount && !promoteFull {
			// Use the previous weight since the new RS is not ready for a new weight
			for i := *index - 1; i >= 0; i-- {
				step := c.rollout.Spec.Strategy.Canary.Steps[i]
				if step.SetWeight != nil {
					desiredWeight = *step.SetWeight
					break
				}
			}

@bpoland
Copy link
Contributor Author

bpoland commented Nov 21, 2021

Hey folks, I've tested the fix and it does prevent the old pods from being scaled down early which is great -- thanks!

Unfortunately it looks like there's still an issue with the traffic weighting -- as soon as I promote-full, the traffic split is updated to show both set and actual weight as 100, and the linkerd traffic split shows that it's sending 100% of traffic to the stable version. This means that all traffic gets temporarily sent back to the old/broken version until all the new canary pods become ready, and then all traffic is switched to the new version at once when the canary version gets promoted to stable.

Instead, I would expect the "actual" weight sent to canary to keep ramping up towards 100% as the canary pods become ready. Then promote the canary RS to stable, and update traffic split so that the new pods continue to get 100%

Please let me know if it makes more sense to reopen this issue or whether I should submit a new issue for this. Thanks!

@bpoland
Copy link
Contributor Author

bpoland commented Nov 21, 2021

Oh and this also means that even with dynamicStableScale enabled, all of the old/stable version pods remain up until the final switch happens. If the "actual" weight were being properly adjusted as the new canary pods came up, I think the old pods would start to scale down as more new pods became ready, right?

@harikrongali
Copy link
Contributor

@bpoland when you do promote-full, at that time, traffic weight wont get updated.
for ex: Stable: 90, canary: 10
promote-full executed
at this moment, stable 90 canary 10 and these values wont be updated until all canary pods are ready.
after all, pods are ready, then canary 100, stable 0

that will be a limitation for promote-full and weights won't get automatically adjusted when canary pods are coming up.

@harikrongali
Copy link
Contributor

and for dynamicStableScale, yes..when canary pods are coming up, stable pods gets adjusted

@bpoland
Copy link
Contributor Author

bpoland commented Nov 21, 2021

Hmm that's not what I'm seeing -- as soon as I promote-full the linkerd traffic split gets updated to 100 stable, 0 canary. I see that in the linkerd console, and it matches what the rollout is showing (set and actual weight are both 100 right after the promote).

I thought this type of situation was what the set/actual weight differences were for -- I would have expected the set weight to be 100 right after the promote-full, and then the actual weight to continue to increase as the new canary pods become ready. What is the difference between set and actual weight then?

@bpoland
Copy link
Contributor Author

bpoland commented Dec 3, 2021

Created a follow-up since I think things are still broken after this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cherry-pick/release-1.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants