You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow-up to #1580 and the discussion at the end. I am still seeing issues with promote-full particularly when it's triggered near the middle/end of the canary steps.
Here's a timeline of a test that I just performed. Note that 7:00 here corresponds with 12:00 in the controller log below.
7:13 - deployment initiated
one new pod starts up with 1% of traffic
over the next 12 minutes, we progress through the defined canary steps
7:25 - 60% of traffic is being sent to canary, corresponding with the defined canary steps
7:26 - I trigger a promote-full
Traffic split is immediately updated to send 100% of traffic to the stable revision (which is still the old ReplicaSet)
All of the old/stable RS pods start terminating and 4 new pods are started (I haven't seen this happen before 😟 )
This means there are no ready pods to serve traffic and you can see nearly nothing gets through until the replacement old pods start
7:28 - replacement pods in the old RS become ready
they start serving traffic again, but still nearly 100% of traffic is going to the old version
7:31 - all pods in the new canary RS become ready
canary RS is promoted to stable
all traffic is flipped immediately from old to new
Here is a graph of traffic during this test:
So there are 2 problems here:
When using promote-full, all traffic is shifted back to the old stable RS until all the new canary RS pods start and it gets promoted to be the new stable RS. Based on the discussion in the previous issue, the traffic split should be frozen where it is until the promotion happens (e.g. in this case, 60% of traffic going to the new RS, then jump up to 100% when the rest of the pods are ready). As I said in the other issue: it would be awesome if instead of freezing the traffic split, it would progressively increase based on the percentage of pods available in the new RS but I understand that's more work.
The old/stable RS pods should not have been restarted when I used promote-full. I have not seen this problem happen before but it's pretty scary that traffic was totally dropped from 7:26 to 7:28 while the pods came back up. It's a good thing this was just a test environment.
I did another test afterwards where I started a deployment and then immediately used promote-full. This behaved pretty much as expected where all traffic stayed on the old/stable version and then flipped to the new as soon as all new pods were ready, and the stable pods didn't get restarted. So issue 2 doesn't seem to happen all the time. Here's the traffic during the second test:
Diagnostics
What version of Argo Rollouts are you running?
v1.1.1 with linkerd traffic split
The solution in my opinion to calculate the weight based on the canary available replicas.
} elseifc.rollout.Status.PromoteFull {
// on a promote full, desired stable weight should be 0 (100% to canary),// But we can only increase canary weight according to available replica counts of the canary.// we will need to set the desiredWeight to 0 when the newRS is not available.newRSAvailableReplicas:=int32(0)
ifc.newRS!=nil {
newRSAvailableReplicas=c.newRS.Status.AvailableReplicas
}
desiredWeight= (100*newRSAvailableReplicas) /*c.rollout.Spec.Replicas
}
I opened a PR #1683 with this fix.
In my tests locally it fixed the issue.
Summary
This is a follow-up to #1580 and the discussion at the end. I am still seeing issues with promote-full particularly when it's triggered near the middle/end of the canary steps.
Here's a timeline of a test that I just performed. Note that 7:00 here corresponds with 12:00 in the controller log below.
7:13 - deployment initiated
7:25 - 60% of traffic is being sent to canary, corresponding with the defined canary steps
7:26 - I trigger a promote-full
7:28 - replacement pods in the old RS become ready
7:31 - all pods in the new canary RS become ready
Here is a graph of traffic during this test:
So there are 2 problems here:
I did another test afterwards where I started a deployment and then immediately used promote-full. This behaved pretty much as expected where all traffic stayed on the old/stable version and then flipped to the new as soon as all new pods were ready, and the stable pods didn't get restarted. So issue 2 doesn't seem to happen all the time. Here's the traffic during the second test:
Diagnostics
What version of Argo Rollouts are you running?
v1.1.1 with linkerd traffic split
Controller log: https://gist.github.com/bpoland/c9a0053832700b27e1525c14c4e81035
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: