Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

bpoland · 2021-10-14T13:07:05Z

Summary

When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.

Instead, it should immediately scale up the new RS to full size, but wait to adjust the traffic split as the new pods become ready. I had thought this was the difference between set weight and actual weight, but that doesn't seem to be the case (maybe someone could help me understand what the difference is then?)

As a side effect when using the new dynamicStableScale feature in 1.1, this means that the old RS gets immediately scaled down and we could be left with very few running/ready pods. I think this is a symptom of the above root cause though, and I guess the traffic split is set to send all traffic to the new RS anyway.

Diagnostics

Rollouts 1.1

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

jessesuen · 2021-10-18T23:31:14Z

When using the promote-full option with traffic splitting, 100% of traffic is immediately sent to the new ReplicaSet, even if it is not scaled up enough to handle the traffic.

This would be a bug. It is intended to behave the way you expected it to.

bpoland · 2021-11-05T14:56:11Z

Hi @khhirani, is there any other information I can provide to help narrow down this issue and get a fix in place? It's a pretty serious issue and constrains the ability to promote. A possible workaround would be to sequentially promote quickly through all the canary steps but #1581 prevents doing that easily in the dashboard which makes it harder for end users.

Thanks!

khhirani · 2021-11-08T10:30:45Z

@bpoland Sorry for the delay! Could you please attach your Rollout spec as well as the controller logs? I'm not able to recreate this issue locally, so I'd like to try it with your rollout.

bpoland · 2021-11-09T00:25:48Z

@khhirani I am able to reproduce the issue consistently on my side, here are the logs from just now when I did it:

Controller log: https://gist.github.com/bpoland/e63bfa02f2c7118e7060b56f2f74e14d
Rollout yaml: https://gist.github.com/bpoland/f4628a6039a150e00bbfd2b042381248
Replicaset yaml (from after issue): https://gist.github.com/bpoland/4a1d951d8a9687cd0fa121ac8621b8d5

The deployment was initiated at 2021-11-09T00:00:07Z by changing the image and a new pod starts coming up. I waited until the first pod was ready and then initiated a promote-full around 2021-11-09T00:02:56Z

At that point even though only 1/5 pods were ready, the controller immediately shifted 100% of traffic to the new RS and scaled down the old RS, as you can see in the controller log.

Please let me know if I can provide any other information to help reproduce. We are using linkerd for traffic splitting, if that makes a difference. Thanks!

harikrongali · 2021-11-16T20:36:06Z

@jessesuen We are able to reproduce but behavior is same with previous versions as well. I wanted to understand what would be the behavior.
When promote full happens, status steps will be set as last step and step weight as 100% so when setting the desired weight, we set the weight at the step while we are bringing up the replicas.

Name:            alb-rollout
Namespace:       smi-bug
Status:          progressiing
Strategy:        Canary
  Step:          14/14
  SetWeight:     100
  ActualWeight:  100

And in code trafficrouting.go L110

if !atDesiredReplicaCount {
			// Use the previous weight since the new RS is not ready for a new weight
			for i := *index - 1; i >= 0; i-- {
				step := c.rollout.Spec.Strategy.Canary.Steps[i]
				if step.SetWeight != nil {
					desiredWeight = *step.SetWeight
					break
				}
			}

bpoland · 2021-11-16T21:58:37Z

When not using dynamicStableScale then I don't think this is an issue right, because all the "old" pods should still be up and running so they should be able to immediately handle all traffic.

It seems like the important part is that ActualWeight shouldn't go higher than n, where n is floor(number of ready pods / total pods). Because I think that's what the scaledown should be based on right?

harikrongali · 2021-11-16T22:02:27Z

even without dynamicStableScale, weight is updated when doing promote-full, the load balancer will send 100% traffic to the canary.

bpoland · 2021-11-16T22:09:01Z

Ahh sorry I was confusing things in my head, yes I see what you mean now. So without dynamicStableScale does the "old" replicaset also get scaled down immediately then? If so, this seems like a pretty serious issue with promote-full...

harikrongali · 2021-11-16T23:19:25Z

Without dynamicStableScale, old replica isn't scaling down immediately. That is an expected behavior

harikrongali · 2021-11-17T23:25:45Z

discussed with @jessesuen. Fix will be to wait for all pods to come up and then update the weight.
in the following block, we need to have another check for promote-full

if !atDesiredReplicaCount && !promoteFull {
			// Use the previous weight since the new RS is not ready for a new weight
			for i := *index - 1; i >= 0; i-- {
				step := c.rollout.Spec.Strategy.Canary.Steps[i]
				if step.SetWeight != nil {
					desiredWeight = *step.SetWeight
					break
				}
			}

bpoland · 2021-11-21T16:27:47Z

Hey folks, I've tested the fix and it does prevent the old pods from being scaled down early which is great -- thanks!

Unfortunately it looks like there's still an issue with the traffic weighting -- as soon as I promote-full, the traffic split is updated to show both set and actual weight as 100, and the linkerd traffic split shows that it's sending 100% of traffic to the stable version. This means that all traffic gets temporarily sent back to the old/broken version until all the new canary pods become ready, and then all traffic is switched to the new version at once when the canary version gets promoted to stable.

Instead, I would expect the "actual" weight sent to canary to keep ramping up towards 100% as the canary pods become ready. Then promote the canary RS to stable, and update traffic split so that the new pods continue to get 100%

Please let me know if it makes more sense to reopen this issue or whether I should submit a new issue for this. Thanks!

bpoland · 2021-11-21T16:44:46Z

Oh and this also means that even with dynamicStableScale enabled, all of the old/stable version pods remain up until the final switch happens. If the "actual" weight were being properly adjusted as the new canary pods came up, I think the old pods would start to scale down as more new pods became ready, right?

harikrongali · 2021-11-21T17:03:09Z

@bpoland when you do promote-full, at that time, traffic weight wont get updated.
for ex: Stable: 90, canary: 10
promote-full executed
at this moment, stable 90 canary 10 and these values wont be updated until all canary pods are ready.
after all, pods are ready, then canary 100, stable 0

that will be a limitation for promote-full and weights won't get automatically adjusted when canary pods are coming up.

harikrongali · 2021-11-21T17:03:59Z

and for dynamicStableScale, yes..when canary pods are coming up, stable pods gets adjusted

bpoland · 2021-11-21T17:15:45Z

Hmm that's not what I'm seeing -- as soon as I promote-full the linkerd traffic split gets updated to 100 stable, 0 canary. I see that in the linkerd console, and it matches what the rollout is showing (set and actual weight are both 100 right after the promote).

I thought this type of situation was what the set/actual weight differences were for -- I would have expected the set weight to be 100 right after the promote-full, and then the actual weight to continue to increase as the new canary pods become ready. What is the difference between set and actual weight then?

bpoland · 2021-12-03T13:11:38Z

Created a follow-up since I think things are still broken after this fix.

bpoland added the bug Something isn't working label Oct 14, 2021

harikrongali modified the milestone: v1.2 Oct 15, 2021

jessesuen added this to the v1.2 milestone Oct 18, 2021

jessesuen added the cherry-pick/release-1.1 label Oct 18, 2021

harikrongali assigned khhirani Oct 20, 2021

harikrongali added cherry-pick/release-1.1 and removed cherry-pick/release-1.1 labels Nov 17, 2021

khhirani mentioned this issue Nov 18, 2021

fix: Promote-full with TrafficRouting waits for all canary pods to come up #1663

Merged

alexmt closed this as completed in #1663 Nov 18, 2021

bpoland mentioned this issue Dec 3, 2021

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

bpoland commented Oct 14, 2021

jessesuen commented Oct 18, 2021

bpoland commented Nov 5, 2021

khhirani commented Nov 8, 2021

bpoland commented Nov 9, 2021

harikrongali commented Nov 16, 2021 •

edited

Loading

bpoland commented Nov 16, 2021

harikrongali commented Nov 16, 2021

bpoland commented Nov 16, 2021

harikrongali commented Nov 16, 2021

harikrongali commented Nov 17, 2021 •

edited

Loading

bpoland commented Nov 21, 2021

bpoland commented Nov 21, 2021

harikrongali commented Nov 21, 2021

harikrongali commented Nov 21, 2021

bpoland commented Nov 21, 2021

bpoland commented Dec 3, 2021

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

Comments

bpoland commented Oct 14, 2021

Summary

Diagnostics

jessesuen commented Oct 18, 2021

bpoland commented Nov 5, 2021

khhirani commented Nov 8, 2021

bpoland commented Nov 9, 2021

harikrongali commented Nov 16, 2021 • edited Loading

bpoland commented Nov 16, 2021

harikrongali commented Nov 16, 2021

bpoland commented Nov 16, 2021

harikrongali commented Nov 16, 2021

harikrongali commented Nov 17, 2021 • edited Loading

bpoland commented Nov 21, 2021

bpoland commented Nov 21, 2021

harikrongali commented Nov 21, 2021

harikrongali commented Nov 21, 2021

bpoland commented Nov 21, 2021

bpoland commented Dec 3, 2021

harikrongali commented Nov 16, 2021 •

edited

Loading

harikrongali commented Nov 17, 2021 •

edited

Loading