-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to control stable scale when using traffic routing #1029
Comments
Hey, came in to create a ticket for this and then saw it's already here :) |
Thinking about this some more, we probably don't need the flexibility for this to be implemented as a new canary step type and can simply have a boolean or constant values to indicate the behavior of the stable scale. e.g.: spec:
strategy:
canary:
dynamicStableScale: true or spec:
strategy:
canary:
stableScale: Full | Dynamic # Names are TBD (Full == existing behavior of leaving stable scaled to 100%, Dynamic == inverse of canary weight) |
The first change is we recalculate stable value returned during an update. Currently we return 100%. func CalculateReplicaCountsForCanary(rollout *v1alpha1.Rollout, newRS *appsv1.ReplicaSet, stableRS *appsv1.ReplicaSet, oldRSs []*appsv1.ReplicaSet) (int32, int32) {
rolloutSpecReplica := defaults.GetReplicasOrDefault(rollout.Spec.Replicas)
replicas, weight := GetCanaryReplicasOrWeight(rollout)
if replicas != nil {
return *replicas, rolloutSpecReplica
}
desiredStableRSReplicaCount := int32(math.Ceil(float64(rolloutSpecReplica) * (1 - (float64(weight) / 100))))
desiredNewRSReplicaCount := int32(math.Ceil(float64(rolloutSpecReplica) * (float64(weight) / 100)))
if rollout.Spec.Strategy.Canary.TrafficRouting != nil {
if !rollout.Spec.Strategy.Canary.dynamicStableScale {
return desiredNewRSReplicaCount, rolloutSpecReplica
}
// if not using dynamic stable scaling, let it fall to logic below use basic canary calculation
} |
The second change which is necessary is that we cannot immediately setWeight to 0 upon abort. we need to wait until stable is scaled back up. func (c *rolloutContext) reconcileTrafficRouting() error {
reconciler, err := c.newTrafficRoutingReconciler(c)
if err != nil {
return err
}
if reconciler == nil {
return nil
}
c.log.Infof("Reconciling TrafficRouting with type '%s'", reconciler.Type())
var canaryHash, stableHash string
if c.stableRS != nil {
stableHash = c.stableRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey]
}
if c.newRS != nil {
canaryHash = c.newRS.Labels[v1alpha1.DefaultRolloutUniqueLabelKey]
}
err = reconciler.UpdateHash(canaryHash, stableHash)
if err != nil {
return err
}
currentStep, index := replicasetutil.GetCurrentCanaryStep(c.rollout)
desiredWeight := int32(0)
if rolloututil.IsFullyPromoted(c.rollout) {
// when we are fully promoted. desired canary weight should be 0
} else if c.pauseContext.IsAborted() {
set canary weight to be: 1 - (total number of available stable replicas / spec.replicas)
// when promote aborted. desired canary weight should be 0 <<<<<<<<<<<< this will no longer be true |
@jessesuen I can work on this issue |
PR: #1382 |
Cross-post in case anyone here has thoughts :) #1382 (comment) |
After discussing on a standup meeting was decided to not implement this feature with a
In a second iteration we may implement it in a dynamic way So I'm going to cancel the current PR |
Thanks for the update -- makes sense to me. Appreciate the consideration here, and the transparency! I think this will definitely address my concerns :) |
@perenesenko - i'd like to understand the desire of having |
@jessesuen I assume it was related to the discussion starting here: #1382 (comment) |
I saw that, but I don't think changing the feature to use canary step syntax would alleviate any confusion. The discussion in #1382 (comment) revealed a flaw in the original approach. The original approach decided the scale of the stable based on the inverse of the canary scale. So in the following example: -setWeight: 0
- setCanaryScale:
weight: 100
- pause: {} When we were at the pause step, the implementation had scaled down the stable because canary scale was set to 100. However, it turns out that the stable scale should be calculated differently. Using this feature (dynamic stable scaling), the stable scale should be the inverse of the traffic weight, not the canary scale. So using the same example above, upon reaching the pause step, we would have both the stable scaled to 100% and the canary scaled to 100%. This is because the traffic weight was still at 0%. I still think we should have a single flag to control this feature, but change the implementation to derive the stable scale based on inverse of the current traffic weight rather than the current canary scale. |
Yep that's essentially what I was thinking too. That would certainly make for a cleaner set of canary steps, although it's slightly less configurable. The only thing I'd add is to make sure that |
Hi @jessesuen , Sorry for the long text I put here. Anyway, we can have a meeting to discuss that by video chat. We have a meeting with Alex about this today at 8 AM you can join or we can have another one. dynamicStableScaleWith the
Of course, we can match the traffic gradually during the scaling process but it's additional logic to the
In this case we also can provide some tricky implementation to have at least 1 pod alive for some time, or even change the order of switching to new stable before scaling down to 0 but again the logic will be specifically for this case.
setStableScaleWhat about the approach with
e.g. Scenario like a
So to summarizeThe We can implement this approach in the nearest release and see how the feature is working. We can get feedback then and if still needed we can implement |
I don't agree the argument that using a Second, having a setStableScale exposes users to the possibility of shooting themselves in the foot if they do not use it correctly. For example, this will cause an outage: - setStableScale:
weight: 0
- pause: {} With a dynamicStableScale boolean, we would never allow this to happen since we will base it off inverse traffic weight.
I also do not think it is a good idea to implement two approaches to accomplish similar features (i.e. implement setStableScale now, and dynamicStableScale later). This will only add to the confusion. Finally, the implementation for dynamicScaling is in many ways simpler than the original approach. It comes down to the following calculations:
Caveat: setting stable scale should always wait until traffic weight is set before ever reducing the amount of stable pods so that we never scale down prematurely. One way to accomplish this is to have stable scale follow the traffic weight (e.g. we set traffic weight first, then scale down stable in subsequent reconciliation when we notice stable scale is too high).
NOTE that in the case of abort, users may want canary to scale down dynamically (similar to the stable), so in the future, we may need a flag like |
@perenesenko - I met with @alexmt earlier today to discuss. I think there was a misunderstanding that you and him had that the traffic weight shift needs to be gradual whenever there is a weight change. Given the following steps: - setWeight: 0
- pause: {duration: 1h}
- setWeight: 100 it is not required to gradually increase from 0 to 100 based on available canary pods. In fact that is the wrong behavior because it prevents the ability to perform blue-green using the canary strategy. If the user desires gradual weight shift, they need to be explicit about this with multiple setWeight steps. e.g. - setWeight: 10
- pause: {duration: 1h}
- setWeight: 20
- pause: {duration: 1h}
- setWeight: 30
- pause: {duration: 1h}
...
- setWeight: 100 Given this requirement, the maxSurge and maxUnavailable do not matter in this feature. We will not honor it. Will scale the canary or stable ReplicaSets as necessary before shifting weights. I plan on doing a quick PoC to provide a skeleton framework implementation of what I described in my previous comment, but may not cover corner cases. |
Here is a working PoC: |
Does this mean it will wait for the newly-scaled pods to be marked as |
Correct. Consider the following Rollout and steps: spec:
replicas: 4
strategy:
canary:
dynamicStableScale: true
steps:
- setWeight: 75
- pause: {} The sequence of events would be:
|
Perfect. This will be a great feature to have! |
For posterity, I'd like to document the abort behavior. Given the same rollout previously: spec:
replicas: 4
strategy:
canary:
dynamicStableScale: true
steps:
- setWeight: 75
- pause: {} Consider the scenario when we abort the Rollout when it is at the pause step. The sequence of events would be:
In #1029 (comment), I also suggested that user may need to scale the canary dynamically as traffic shifts away from it during the abort. In the steps described above, notice that canary remained at size 3, until abortScaleDownDelay. However, in environments which who do not have capacity to run both the canary and stable and high scales, they will need a flag like |
Regarding my comments above, instead of introducing yet another knob However, if user wants the canary to remain scaled up, they can explicitly set a value for abortScaleDownDelaySeconds, which we can honor. spec:
replicas: 4
strategy:
canary:
dynamicStableScale: true
steps:
- setWeight: 75
- pause: {} So the updated sequence of events during abort using dynamic scaling is
|
I've been thinking a bit more about our earlier discussion and commented on another ticket: #430 (comment) I wonder if in the context of this change, whether there could be an option to respect the maxSurge/maxUnavailable across both the stable and canary replicasets which would help prevent a thundering herd of pods. So during the initial rollout, only up to |
This won't help during a promote-full scenario or during an initial deploy, but during subsequent updates, it's possible to gradually bring up the canary before the 100% traffic shift. Example: steps:
- setCanaryScale:
weight: 10
- setCanaryScale:
weight: 20
...
- setCanaryScale:
weight: 100
- setWeight: 100 # now that we've brought up canary weight gradually to prevent thundering herd, shift weight.
...
I understand the request. But it would need to come in as a separate enhancement since current PR for this feature is not aimed to address that problem, and we are no worse than the current situation. Also, That said, I also feel it would be desirable to mitigate thundering herd startup problem outside of rollouts. For example, the same problem happens with Deployments and would not benefit from an improvement. Also there may be tricks that can be done with init containers and semaphores. For example, a redis server could be used as a distributed semaphore to prevent more than N concurrent access to the database during startup. |
Fair enough, that makes sense. Appreciate the detailed response! |
Summary
When using traffic routing with canary, the stable stack is never scaled down, until the rollout is fully promoted. This means that we are double the number of total replicas in the environment eventually (when the traffic weight reaches 100). The reasons for doing so are explained here: #430 (comment)
However, some users have requested that the stable stack be scaled inversely to the canary weight, in order to save on resources (e.g. a baremetal clusters where resource provisioning cannot be doubled). For these users, it is acceptable to rollback slowly during an abort, and they would be willing to sacrifice instantaneous rollbacks.
This proposal is to offer a
setStableScale
option, which would set the number of stable replicas to be the inverse of the canary weight.Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: