Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): Capping the replica count during a canary deployment #1743

Closed

Conversation

johnniee
Copy link

@johnniee johnniee commented Jan 4, 2022

This PR is following an issue we encountered during our adoption of ArgoRollouts.
There is an open issue I created here:
#1707

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 4, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@jessesuen
Copy link
Member

The behavior of rounding up was actually by design and we have tests that explicitly test for this (which are now failing in this PR):

TestCalculateReplicaCountsForCanary/Add_an_extra_replica_to_surge_when_the_setWeight_rounding_adds_another_instance

So this is a breaking change and will need to be carefully considered whether or not if this change can be accepted based on how different the behavior vs. the actual improvements (e.g. if the new calculations approximate the desired canary weights better).

Could you provide some example numbers on the differences in the calculation from before and after to help decide?

Also, as you may or not be aware, a workaround, it's possible to avoid the rounding behavior but only if you use replica counts which are evenly divisible to 100. e.g. If you have a replica count of 4, you can use canary weights of 25, 50, 75. Similarly, a replica count of 5 could use canary weights of 20, 40, 60, 80. Unfortunately, you could not use replica counts which are not divisible to 100. e.g. you could not use a replica count of 3 because the rollout would choose to round up.

@liorfranko
Copy link

Hi,

The use-case we're trying to solve is performing a Canary rollout to kafka consumers.
And during a Canary, the total number of pods must remain the same.

In Kafka, the number of replicas of pods (consumers) should be in correlation to the number of partitions configured on the Kafka topic.
And choosing the number of partitions on a Kafka topic should be correlated to the number of Kafka brokers in the cluster.
So changing the number of replicas of the pods (consumers) has many dependencies and it's not an easy thing to do.

I guess this change in behavior will pop up again for other companies as well.

@jessesuen
Copy link
Member

The use-case we're trying to solve is performing a Canary rollout to kafka consumers.
And during a Canary, the total number of pods must remain the same.

I do understand the use case with respect to kafka consumers and ensuring it does not exceed replica count. This is normally achieved by setting maxSurge to 0. However, I think the issue is that Argo Rollotus does not respect maxSurge if it feels like it should achieve a certain canary weight via rounding. I think there is room to improve this where we can round up but only if it does not exceed max surge.

@jessesuen
Copy link
Member

I have created PR #1759 which reworks the basic canary calculation such that maxSurge: 0 will now be honored. With the fix, if a user sets maxSurge: 0 it will ensure that stable + canary replica counts do not exceed spec.replicas. My PR should supersede this one.

@johnniee johnniee closed this Jan 13, 2022
@johnniee johnniee deleted the issue-1707-replicas-capping branch January 13, 2022 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants