Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

bpoland · 2021-12-03T13:10:35Z

Summary

This is a follow-up to #1580 and the discussion at the end. I am still seeing issues with promote-full particularly when it's triggered near the middle/end of the canary steps.

Here's a timeline of a test that I just performed. Note that 7:00 here corresponds with 12:00 in the controller log below.

7:13 - deployment initiated

one new pod starts up with 1% of traffic
over the next 12 minutes, we progress through the defined canary steps

7:25 - 60% of traffic is being sent to canary, corresponding with the defined canary steps

7:26 - I trigger a promote-full

Traffic split is immediately updated to send 100% of traffic to the stable revision (which is still the old ReplicaSet)
All of the old/stable RS pods start terminating and 4 new pods are started (I haven't seen this happen before 😟 )
This means there are no ready pods to serve traffic and you can see nearly nothing gets through until the replacement old pods start

7:28 - replacement pods in the old RS become ready

they start serving traffic again, but still nearly 100% of traffic is going to the old version

7:31 - all pods in the new canary RS become ready

canary RS is promoted to stable
all traffic is flipped immediately from old to new

Here is a graph of traffic during this test:

So there are 2 problems here:

When using promote-full, all traffic is shifted back to the old stable RS until all the new canary RS pods start and it gets promoted to be the new stable RS. Based on the discussion in the previous issue, the traffic split should be frozen where it is until the promotion happens (e.g. in this case, 60% of traffic going to the new RS, then jump up to 100% when the rest of the pods are ready). As I said in the other issue: it would be awesome if instead of freezing the traffic split, it would progressively increase based on the percentage of pods available in the new RS but I understand that's more work.
The old/stable RS pods should not have been restarted when I used promote-full. I have not seen this problem happen before but it's pretty scary that traffic was totally dropped from 7:26 to 7:28 while the pods came back up. It's a good thing this was just a test environment.

I did another test afterwards where I started a deployment and then immediately used promote-full. This behaved pretty much as expected where all traffic stayed on the old/stable version and then flipped to the new as soon as all new pods were ready, and the stable pods didn't get restarted. So issue 2 doesn't seem to happen all the time. Here's the traffic during the second test:

Diagnostics

What version of Argo Rollouts are you running?
v1.1.1 with linkerd traffic split

Controller log: https://gist.github.com/bpoland/c9a0053832700b27e1525c14c4e81035

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

nissy34 · 2021-12-06T10:12:03Z

HI
I was facing the same issue.

The solution in my opinion to calculate the weight based on the canary available replicas.

		} else if c.rollout.Status.PromoteFull {
			// on a promote full, desired stable weight should be 0 (100% to canary),
			// But we can only increase canary weight according to available replica counts of the canary.
			// we will need to set the desiredWeight to 0 when the newRS is not available.
			newRSAvailableReplicas := int32(0)
			if c.newRS != nil {
				newRSAvailableReplicas = c.newRS.Status.AvailableReplicas
			}
			desiredWeight = (100 * newRSAvailableReplicas) / *c.rollout.Spec.Replicas
		}

I opened a PR #1683 with this fix.
In my tests locally it fixed the issue.

…g to available canary pods. Fixes: #1681 (#1683) Signed-off-by: Nissan Hefetz <[email protected]>

bpoland added the bug Something isn't working label Dec 3, 2021

bpoland mentioned this issue Dec 3, 2021

Promote-full with traffic splitting doesn't wait for new pods to be ready #1580

Closed

perenesenko added this to the v1.2 milestone Dec 3, 2021

nissy34 mentioned this issue Dec 6, 2021

fix: Promote-full with TrafficRouting will dynamically increase weight according to available canary pods. Fixes: #1681 #1683

Merged

bpoland mentioned this issue Dec 9, 2021

Panic while trying to reconcile Rollout, most stable pods restarted #1696

Closed

harikrongali self-assigned this Jan 4, 2022

jessesuen closed this as completed in #1683 Jan 21, 2022

jessesuen pushed a commit that referenced this issue Jan 21, 2022

fix: Promote-full with dynamicStableScaling increases weight accordin…

a74ad8e

…g to available canary pods. Fixes: #1681 (#1683) Signed-off-by: Nissan Hefetz <[email protected]>

harikrongali mentioned this issue Jan 31, 2022

Promote at end of deployment rolls traffic back to stable for 30 seconds #1826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

bpoland commented Dec 3, 2021

nissy34 commented Dec 6, 2021

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

Comments

bpoland commented Dec 3, 2021

Summary

Diagnostics

nissy34 commented Dec 6, 2021