Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote-full with Traffic Splitting sends traffic to non-ready ReplicaSet (still) #1681

Closed
bpoland opened this issue Dec 3, 2021 · 1 comment · Fixed by #1683
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@bpoland
Copy link
Contributor

bpoland commented Dec 3, 2021

Summary

This is a follow-up to #1580 and the discussion at the end. I am still seeing issues with promote-full particularly when it's triggered near the middle/end of the canary steps.

Here's a timeline of a test that I just performed. Note that 7:00 here corresponds with 12:00 in the controller log below.

7:13 - deployment initiated

  • one new pod starts up with 1% of traffic
  • over the next 12 minutes, we progress through the defined canary steps

7:25 - 60% of traffic is being sent to canary, corresponding with the defined canary steps

7:26 - I trigger a promote-full

  • Traffic split is immediately updated to send 100% of traffic to the stable revision (which is still the old ReplicaSet)
  • All of the old/stable RS pods start terminating and 4 new pods are started (I haven't seen this happen before 😟 )
  • This means there are no ready pods to serve traffic and you can see nearly nothing gets through until the replacement old pods start

7:28 - replacement pods in the old RS become ready

  • they start serving traffic again, but still nearly 100% of traffic is going to the old version

7:31 - all pods in the new canary RS become ready

  • canary RS is promoted to stable
  • all traffic is flipped immediately from old to new

Here is a graph of traffic during this test:
image

So there are 2 problems here:

  1. When using promote-full, all traffic is shifted back to the old stable RS until all the new canary RS pods start and it gets promoted to be the new stable RS. Based on the discussion in the previous issue, the traffic split should be frozen where it is until the promotion happens (e.g. in this case, 60% of traffic going to the new RS, then jump up to 100% when the rest of the pods are ready). As I said in the other issue: it would be awesome if instead of freezing the traffic split, it would progressively increase based on the percentage of pods available in the new RS but I understand that's more work.
  2. The old/stable RS pods should not have been restarted when I used promote-full. I have not seen this problem happen before but it's pretty scary that traffic was totally dropped from 7:26 to 7:28 while the pods came back up. It's a good thing this was just a test environment.

I did another test afterwards where I started a deployment and then immediately used promote-full. This behaved pretty much as expected where all traffic stayed on the old/stable version and then flipped to the new as soon as all new pods were ready, and the stable pods didn't get restarted. So issue 2 doesn't seem to happen all the time. Here's the traffic during the second test:
image

Diagnostics

What version of Argo Rollouts are you running?
v1.1.1 with linkerd traffic split

Controller log: https://gist.github.com/bpoland/c9a0053832700b27e1525c14c4e81035


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@nissy34
Copy link
Contributor

nissy34 commented Dec 6, 2021

HI
I was facing the same issue.

The solution in my opinion to calculate the weight based on the canary available replicas.

		} else if c.rollout.Status.PromoteFull {
			// on a promote full, desired stable weight should be 0 (100% to canary),
			// But we can only increase canary weight according to available replica counts of the canary.
			// we will need to set the desiredWeight to 0 when the newRS is not available.
			newRSAvailableReplicas := int32(0)
			if c.newRS != nil {
				newRSAvailableReplicas = c.newRS.Status.AvailableReplicas
			}
			desiredWeight = (100 * newRSAvailableReplicas) / *c.rollout.Spec.Replicas
		}

I opened a PR #1683 with this fix.
In my tests locally it fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants