Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxUnavailable works as a batch size, not as a rate #141

Open
uthark opened this issue Nov 14, 2020 · 2 comments
Open

MaxUnavailable works as a batch size, not as a rate #141

uthark opened this issue Nov 14, 2020 · 2 comments

Comments

@uthark
Copy link
Contributor

uthark commented Nov 14, 2020

Is this a BUG REPORT or FEATURE REQUEST?:
Bug
What happened:
If user configured MaxUnavailable, then keiko uses it as a batch for rollout. I.e. if it is set to 5, then keiko will create batches of size 5 and roll them one after another.
What you expected to happen:
Keiko should always try to roll MaxUnavailable nodes - treat MaxUnavailable as number of nodes to be rolled out at all times.

How to reproduce it (as minimally and precisely as possible):

The downside of the implementation is that single node may delay up to 1h rollout.

Relevant code: https://github.com/keikoproj/upgrade-manager/blob/master/controllers/rollingupgrade_controller.go#L477-L481
New instances are selected only after current batch finishes.

Other info: we have some groups that have >150 nodes, rollout takes very long time because of this issue and #140
Also, happy to submit PR once we agree on a proper fix.

@uthark uthark changed the title MaxUnavailable works as a batch, not as a percentage. MaxUnavailable works as a batch size, not as a rate Nov 14, 2020
@uthark
Copy link
Contributor Author

uthark commented Nov 17, 2020

Proposed solution is the following:

  • Rolling upgrades should continue to replace nodes at a MaxUnavailable rate.
  • in a loop we will do the following:
  1. If a number of instances in standby mode != maxUnavailable, put x instances to standby without decrementing ASG size.
  2. AWS would start new instances for replacement.
  3. Wait for new nodes to be registered in the cluster and pass all checks.
  4. Drain and terminate node from standby (use node selector to determine AZ, if it's uniformed AZ node selector) and wait for it’s termination.

This is a big change from current eager implementation, so might be implemented as new strategy.

Also, this works in similar way to how kubernetes upgrades pods with pod disruption budget.

@shrinandj / @eytan-avisror what do you think about it? Would you accept such PR?

@shrinandj
Copy link
Collaborator

This approach looks great to me..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants