Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to control stable scale when using traffic routing #1029

Closed
jessesuen opened this issue Mar 10, 2021 · 25 comments · Fixed by #1430
Closed

Ability to control stable scale when using traffic routing #1029

jessesuen opened this issue Mar 10, 2021 · 25 comments · Fixed by #1430
Assignees
Labels
enhancement New feature or request traffic-routing
Milestone

Comments

@jessesuen
Copy link
Member

Summary

When using traffic routing with canary, the stable stack is never scaled down, until the rollout is fully promoted. This means that we are double the number of total replicas in the environment eventually (when the traffic weight reaches 100). The reasons for doing so are explained here: #430 (comment)

The reason for this is that when a mesh or ingress like Istio is being used, users will be now be able to shift traffic much more rapidly/sporadically (e.g. go from 1% to 99% in an extreme example).

During design, it was an important requirement to be able to abort a rollout and immediately go from XX% traffic percentage back down to 0% instantly, without it becoming delayed/prevented by external factors such as replicaset and pod orchestration. So the decision was made to keep the canary stack size the same as the stable stack (e.g. simply keep replica counts in-sync), for the duration of the update.

However, some users have requested that the stable stack be scaled inversely to the canary weight, in order to save on resources (e.g. a baremetal clusters where resource provisioning cannot be doubled). For these users, it is acceptable to rollback slowly during an abort, and they would be willing to sacrifice instantaneous rollbacks.

This proposal is to offer a setStableScale option, which would set the number of stable replicas to be the inverse of the canary weight.

steps:
- setStableScale:
    matchTrafficWeight: true  # it would be inverse of canary weight (default when using basic canary)
- setStableScale:
    weight: 100   # default behavior of canary using trafficRouting
- setStableScale:
    replicas: 4     # not sure anyone would need this option, but can have it to be consistent with setCanaryScale

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@jessesuen jessesuen added the enhancement New feature or request label Mar 10, 2021
@nestorsokil
Copy link

Hey, came in to create a ticket for this and then saw it's already here :)
I think that's a very important feature to have. For my use case the bloated pods during canary is probably the only problem with using rollouts.

@jessesuen
Copy link
Member Author

jessesuen commented Jul 15, 2021

Thinking about this some more, we probably don't need the flexibility for this to be implemented as a new canary step type and can simply have a boolean or constant values to indicate the behavior of the stable scale. e.g.:

spec:
  strategy:
    canary:
      dynamicStableScale: true

or

spec:
  strategy:
    canary:
      stableScale: Full | Dynamic  # Names are TBD (Full == existing behavior of leaving stable scaled to 100%, Dynamic == inverse of canary weight)

@jessesuen
Copy link
Member Author

jessesuen commented Jul 15, 2021

The first change is we recalculate stable value returned during an update. Currently we return 100%.

func CalculateReplicaCountsForCanary(rollout *v1alpha1.Rollout, newRS *appsv1.ReplicaSet, stableRS *appsv1.ReplicaSet, oldRSs []*appsv1.ReplicaSet) (int32, int32) {
	rolloutSpecReplica := defaults.GetReplicasOrDefault(rollout.Spec.Replicas)
	replicas, weight := GetCanaryReplicasOrWeight(rollout)
	if replicas != nil {
		return *replicas, rolloutSpecReplica
	}

	desiredStableRSReplicaCount := int32(math.Ceil(float64(rolloutSpecReplica) * (1 - (float64(weight) / 100))))
	desiredNewRSReplicaCount := int32(math.Ceil(float64(rolloutSpecReplica) * (float64(weight) / 100)))

	if rollout.Spec.Strategy.Canary.TrafficRouting != nil {
		if !rollout.Spec.Strategy.Canary.dynamicStableScale {
			return desiredNewRSReplicaCount, rolloutSpecReplica
		}
		// if not using dynamic stable scaling, let it fall to logic below use basic canary calculation
	}

@jessesuen
Copy link
Member Author

jessesuen commented Jul 15, 2021

The second change which is necessary is that we cannot immediately setWeight to 0 upon abort. we need to wait until stable is scaled back up.

func (c *rolloutContext) reconcileTrafficRouting() error {
	reconciler, err := c.newTrafficRoutingReconciler(c)
	if err != nil {
		return err
	}
	if reconciler == nil {
		return nil
	}
	c.log.Infof("Reconciling TrafficRouting with type '%s'", reconciler.Type())

	var canaryHash, stableHash string
	if c.stableRS != nil {
		stableHash = c.stableRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey]
	}
	if c.newRS != nil {
		canaryHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey]
	}
	err = reconciler.UpdateHash(canaryHash, stableHash)
	if err != nil {
		return err
	}

	currentStep, index := replicasetutil.GetCurrentCanaryStep(c.rollout)
	desiredWeight := int32(0)
	if rolloututil.IsFullyPromoted(c.rollout) {
		// when we are fully promoted. desired canary weight should be 0
	} else if c.pauseContext.IsAborted() {
                 set canary weight to be: 1 - (total number of available stable replicas / spec.replicas)
		// when promote aborted. desired canary weight should be 0  <<<<<<<<<<<< this will no longer be true

@perenesenko
Copy link
Member

@jessesuen I can work on this issue

@perenesenko
Copy link
Member

PR: #1382

@bpoland
Copy link
Contributor

bpoland commented Aug 9, 2021

Cross-post in case anyone here has thoughts :) #1382 (comment)

@perenesenko
Copy link
Member

After discussing on a standup meeting was decided to not implement this feature with a dynamicStableScale flag. There are a lot of open questions during scaling up and abort.
Was decided to implement in a way it was described in the first comment to this issue, using the setStableScale configurations. This gives more control and confidence in what's going on during the rollout promotion.

- setStableScale:
    matchTrafficWeight: true  # it would be inverse of canary weight (default when using basic canary)
- setStableScale:
    weight: 100   # default behavior of canary using trafficRouting
- setStableScale:
    replicas: 4     # not sure anyone would need this option, but can have it to be consistent with setCanaryScale

In a second iteration we may implement it in a dynamic way

So I'm going to cancel the current PR

@bpoland
Copy link
Contributor

bpoland commented Aug 16, 2021

Thanks for the update -- makes sense to me. Appreciate the consideration here, and the transparency! I think this will definitely address my concerns :)

@jessesuen
Copy link
Member Author

Was decided to implement in a way it was described in the first comment to this issue, using the setStableScale configurations. This gives more control and confidence in what's going on during the rollout promotion.

@perenesenko - i'd like to understand the desire of having setStableScale steps vs. a dynamicStableScale: true flag. I actually considered the former more confusing and why I favored the idea of having a true/false switch.

@bpoland
Copy link
Contributor

bpoland commented Aug 17, 2021

@perenesenko - i'd like to understand the desire of having setStableScale steps vs. a dynamicStableScale: true flag. I actually considered the former more confusing and why I favored the idea of having a true/false switch.

@jessesuen I assume it was related to the discussion starting here: #1382 (comment)

@jessesuen
Copy link
Member Author

jessesuen commented Aug 17, 2021

I saw that, but I don't think changing the feature to use canary step syntax would alleviate any confusion.

The discussion in #1382 (comment) revealed a flaw in the original approach. The original approach decided the scale of the stable based on the inverse of the canary scale. So in the following example: ​

-setWeight: 0
- setCanaryScale:
 ​weight: 100
- pause: {}

When we were at the pause step, the implementation had scaled down the stable because canary scale was set to 100.

However, it turns out that the stable scale should be calculated differently. Using this feature (dynamic stable scaling), the stable scale should be the inverse of the traffic weight, not the canary scale.

So using the same example above, upon reaching the pause step, we would have both the stable scaled to 100% and the canary scaled to 100%. This is because the traffic weight was still at 0%.

I still think we should have a single flag to control this feature, but change the implementation to derive the stable scale based on inverse of the current traffic weight rather than the current canary scale.

@bpoland
Copy link
Contributor

bpoland commented Aug 17, 2021

Yep that's essentially what I was thinking too. That would certainly make for a cleaner set of canary steps, although it's slightly less configurable.

The only thing I'd add is to make sure that maxUnavailable is respected throughout the rollout. In an extreme example, going from setWeight: 0 straight to setWeight: 99 should wait for the canary pods to start and be accepting traffic before scaling down the stable pods.

@perenesenko
Copy link
Member

Hi @jessesuen ,

Sorry for the long text I put here. Anyway, we can have a meeting to discuss that by video chat. We have a meeting with Alex about this today at 8 AM you can join or we can have another one.
Here are my thoughts about the implementation options.

dynamicStableScale

With the dynamicStableScale flag we will have a couple of complicated places in our implementation.

  1. e.g. When we're set the weight to 80 first we're reconciling the number of pods. Once it finally established we're changing the traffic. So here we have some time when we're sending 100 of traffic to the scaled-down stable pods.

Of course, we can match the traffic gradually during the scaling process but it's additional logic to the reconcileTrafficRouting code.

  1. The second case happening with the Nginx. When we scaling canary to 100% automatically the number of pods for the stable goes to 0. And in this case the Nginx stops serving the traffic (it's documented issue).

In this case we also can provide some tricky implementation to have at least 1 pod alive for some time, or even change the order of switching to new stable before scaling down to 0 but again the logic will be specifically for this case.

  1. One more thing with this approach is that there is less control for the canary process and a lack of understanding of what's going on under the hood.

  2. Another thing the scaleDownDelaySeconds parameter is not working in this case as when fully promote - the number of stable pods will be set to 0. So there is no sense in a delay as we have 0 stable pods.

setStableScale

What about the approach with setStableScale. The pros of this

  1. It's clear for understanding. And you understand if you set stable replicas to 5 - you got what you want
  2. The problem with an Nginx. We can solve it by setting scaleDownDelaySeconds and it will work
  3. We can have a similar to dynamicStableScale scenario when will set the setStableScale.matchTrafficWeight=true first

e.g. Scenario like a dynamicStableScale=true

- setStableScale:
    matchTrafficWeight: true
- setWeight: 20 // this will set the stable pods weight to 80
- pause: {}
- setWeight: 40 // this will set the stable pods weight to 60
- pause: {}

So to summarize

The setStableScale approach looks a little bit easier to implement as it repeats somehow the logic of setCanaryScale.
It's easier to understand and gives a lot of control to customers. In opposite to that the dynamicStableScale will have more "specific" logic on different levels.

We can implement this approach in the nearest release and see how the feature is working. We can get feedback then and if still needed we can implement dynamicStableScale approach later.

@jessesuen
Copy link
Member Author

jessesuen commented Aug 18, 2021

I don't agree the argument that using a setStableScale step is easier to understand than having a dynamicStableScale flag. In fact I think it is harder to understand because you have to take into consideration what step you are on to understand how the stable will be scaled. A single dynamicStableScale flag is easy to explain -- stable will always be scaled to the inverse of traffic weight.

Second, having a setStableScale exposes users to the possibility of shooting themselves in the foot if they do not use it correctly. For example, this will cause an outage:

- setStableScale:
    weight: 0
- pause: {}

With a dynamicStableScale boolean, we would never allow this to happen since we will base it off inverse traffic weight.

We can implement this approach in the nearest release and see how the feature is working. We can get feedback then and if still needed we can implement dynamicStableScale approach later.

I also do not think it is a good idea to implement two approaches to accomplish similar features (i.e. implement setStableScale now, and dynamicStableScale later). This will only add to the confusion.

Finally, the implementation for dynamicScaling is in many ways simpler than the original approach. It comes down to the following calculations:

  1. During an update, stable scale should always follow desired canary traffic weight (and not canary replica scale):
<spec.replicas> * (100 - <desired current canary trafficweight>)

Caveat: setting stable scale should always wait until traffic weight is set before ever reducing the amount of stable pods so that we never scale down prematurely. One way to accomplish this is to have stable scale follow the traffic weight (e.g. we set traffic weight first, then scale down stable in subsequent reconciliation when we notice stable scale is too high).

  1. During an abort we only need to do these calculations
  • stable scale should be 100% (i.e. <spec.replicas>)
  • traffic weight should be calculated based on stable pod availability:
100 - ( (100 * <available stable pods>) / <spec.replicas> )
  • canary scale calculation does not change drastically from current implementation. It will use our current logic to honor abortScaleDownDelay. But it will need wait to do this when traffic is back to 100% stable, or else we may scale down canary prematurely.

NOTE that in the case of abort, users may want canary to scale down dynamically (similar to the stable), so in the future, we may need a flag like abortDynamicCanaryScale to allow the canary to dynamically scale down during an abort.

@jessesuen
Copy link
Member Author

jessesuen commented Aug 19, 2021

@perenesenko - I met with @alexmt earlier today to discuss. I think there was a misunderstanding that you and him had that the traffic weight shift needs to be gradual whenever there is a weight change. Given the following steps:

- setWeight: 0
- pause: {duration: 1h}
- setWeight: 100

it is not required to gradually increase from 0 to 100 based on available canary pods. In fact that is the wrong behavior because it prevents the ability to perform blue-green using the canary strategy. If the user desires gradual weight shift, they need to be explicit about this with multiple setWeight steps. e.g.

- setWeight: 10
- pause: {duration: 1h}
- setWeight: 20
- pause: {duration: 1h}
- setWeight: 30
- pause: {duration: 1h}
...
- setWeight: 100

Given this requirement, the maxSurge and maxUnavailable do not matter in this feature. We will not honor it. Will scale the canary or stable ReplicaSets as necessary before shifting weights.

I plan on doing a quick PoC to provide a skeleton framework implementation of what I described in my previous comment, but may not cover corner cases.

@jessesuen
Copy link
Member Author

Here is a working PoC:

#1430

@bpoland
Copy link
Contributor

bpoland commented Aug 19, 2021

Given this requirement, the maxSurge and maxUnavailable do not matter in this feature. We will not honor it. Will scale the canary or stable ReplicaSets as necessary before shifting weights.

Does this mean it will wait for the newly-scaled pods to be marked as ready before shifting traffic (is that the difference between setWeight and actualWeight)? That was my main concern with my maxUnavailable comment earlier.

@jessesuen
Copy link
Member Author

Does this mean it will wait for the newly-scaled pods to be marked as ready before shifting traffic (is that the difference between setWeight and actualWeight)? That was my main concern with my maxUnavailable comment earlier.

Correct. Consider the following Rollout and steps:

spec:
  replicas: 4
  strategy:
    canary:
      dynamicStableScale: true
      steps:
      - setWeight: 75
      - pause: {}

The sequence of events would be:

  1. At beginning of update (steady state) we are at:
    stable: 4, canary: 0, desiredWeight: 0, actualWeight: 0
  2. When processing the first step, we scale to:
    stable: 4, canary 3, desiredWeight: 75, actualWeight: 0
  3. As soon as canary reaches availability 3, we update traffic object (e.g. VirtualService) weight
    stable: 4, canary: 3, desiredWeight: 75, actualWeight: 75.
    Now that the actual weight has been set to 75, we have completed the first step.
  4. On next reconciliation, the controller notices that the stable is scaled too high (because dynamicStableScale is enabled). The stable is scaled down to inverse of actualWeight:
    stable: 1, canary: 3, desiredWeight: 75, actualWeight: 75

@bpoland
Copy link
Contributor

bpoland commented Aug 19, 2021

Perfect. This will be a great feature to have!

@jessesuen
Copy link
Member Author

jessesuen commented Aug 20, 2021

For posterity, I'd like to document the abort behavior. Given the same rollout previously:

spec:
  replicas: 4
  strategy:
    canary:
      dynamicStableScale: true
      steps:
      - setWeight: 75
      - pause: {}

Consider the scenario when we abort the Rollout when it is at the pause step. The sequence of events would be:

  1. Before the abort, at the pause step, we started at:
    stable: 1, canary: 3, desiredWeight: 75, actualWeight: 75
  2. Upon abort, we immediately scale up stable to 100%. The desired weight will become 0, but actual weight remains untouched initially:
    stable: 4, canary: 3, desiredWeight: 0, actualWeight: 75
  3. As the stable pods become available, we adjust traffic weight according to stable pod availability. Let's say 2/4 stable pods are available. The actual weight would be set to 50.
    stable: 4, canary: 3, desiredWeight: 0, actualWeight: 50
  4. If 3/4 stable pods become available, the canary weight will continue to reduce.
    stable: 4, canary: 3, desiredWeight: 0, actualWeight: 25
  5. Finally, once all 4 stable pods become available, the canary weight will go back to 0.
    stable: 4, canary: 3, desiredWeight: 0, actualWeight: 0
  6. Now that traffic weight is back to 0. The rollout will honor abortScaleDownDelay and scale down canary after some time.
    stable: 4, canary: 0, desiredWeight: 0, actualWeight: 0

In #1029 (comment), I also suggested that user may need to scale the canary dynamically as traffic shifts away from it during the abort. In the steps described above, notice that canary remained at size 3, until abortScaleDownDelay. However, in environments which who do not have capacity to run both the canary and stable and high scales, they will need a flag like abortDynamicCanaryScale to scale down the canary as traffic shifts back to stable.

@jessesuen
Copy link
Member Author

jessesuen commented Aug 26, 2021

NOTE that in the case of abort, users may want canary to scale down dynamically (similar to the stable), so in the future, we may need a flag like abortDynamicCanaryScale to allow the canary to dynamically scale down during an abort.

However, in environments which who do not have capacity to run both the canary and stable and high scales, they will need a flag like abortDynamicCanaryScale to scale down the canary as traffic shifts back to stable.

Regarding my comments above, instead of introducing yet another knob abortDynamicCanaryScale I think it is better to have that behavior as the default. I think most users who want the stable ReplicaSet to dynamically scale down as traffic weight increases to the canary, also want the canary to scale down dynamically as traffic increases to the stable. For that reason we should combine the two requests into one.

However, if user wants the canary to remain scaled up, they can explicitly set a value for abortScaleDownDelaySeconds, which we can honor.

spec:
  replicas: 4
  strategy:
    canary:
      dynamicStableScale: true
      steps:
      - setWeight: 75
      - pause: {}

So the updated sequence of events during abort using dynamic scaling is

  1. Before the abort, at the pause step, we started at:
    stable: 1/1, canary: 3/3, desiredWeight: 75, actualWeight: 75

  2. Upon abort, we immediately scale up stable to 100%. The desired weight will become 0, but actual weight remains untouched initially:
    stable: 1/4, canary: 3/3, desiredWeight: 0, actualWeight: 75

  3. As the stable pods become available, we adjust traffic weight according to stable pod availability. Let's say 2/4 stable pods are available. The actual weight would be set to 50.
    stable: 2/4, canary: 3/3, desiredWeight: 0, actualWeight: 50

  4. Now that actual weight decreased to 50, the canary scale is reduced to 2
    stable: 2/4, canary: 2/2, desiredWeight: 0, actualWeight: 50

  5. when 3/4 stable pods become available, the canary weight will continue to reduce.
    stable: 3/4, canary: 2/2, desiredWeight: 0, actualWeight: 25

  6. and canary scale would keep following actualWeight:
    stable: 3/4, canary: 1/1, desiredWeight: 0, actualWeight: 25

  7. Finally, once all 4 stable pods become available, the canary weight will go back to 0.
    stable: 4/4, canary: 1/1, desiredWeight: 0, actualWeight: 0

  8. Once actual weight reaches 0, canary scale is reduced to 0
    stable: 4/4, canary: 0/0, desiredWeight: 0, actualWeight: 0

@jessesuen jessesuen self-assigned this Aug 27, 2021
@bpoland
Copy link
Contributor

bpoland commented Aug 31, 2021

I've been thinking a bit more about our earlier discussion and commented on another ticket: #430 (comment)

I wonder if in the context of this change, whether there could be an option to respect the maxSurge/maxUnavailable across both the stable and canary replicasets which would help prevent a thundering herd of pods. So during the initial rollout, only up to maxSurge new canary pods could be started. Then once those become ready, the stable RS would scale down that number of pods, and maxSurge more canary pods could be started, etc.

@jessesuen
Copy link
Member Author

This won't help during a promote-full scenario or during an initial deploy, but during subsequent updates, it's possible to gradually bring up the canary before the 100% traffic shift. Example:

steps:
- setCanaryScale:
    weight: 10
- setCanaryScale:
    weight: 20
...
- setCanaryScale:
    weight: 100
- setWeight: 100 # now that we've brought up canary weight gradually to prevent thundering herd, shift weight.
...

I wonder if in the context of this change, whether there could be an option to respect the maxSurge/maxUnavailable across both the stable and canary replicasets which would help prevent a thundering herd of pods

I understand the request. But it would need to come in as a separate enhancement since current PR for this feature is not aimed to address that problem, and we are no worse than the current situation. Also, maxSurge would be the wrong concept/terminology to reuse because surge is defined as the amount over the desired replica count that the controller is allowed to create. I don't think we want two definitions of maxSurge which and have it mean something different in the case of initial deploy / full-promotion.

That said, I also feel it would be desirable to mitigate thundering herd startup problem outside of rollouts. For example, the same problem happens with Deployments and would not benefit from an improvement. Also there may be tricks that can be done with init containers and semaphores. For example, a redis server could be used as a distributed semaphore to prevent more than N concurrent access to the database during startup.

@bpoland
Copy link
Contributor

bpoland commented Sep 1, 2021

Fair enough, that makes sense. Appreciate the detailed response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request traffic-routing
Projects
None yet
5 participants