Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance Refresh for EC2 Auto Scaling #929

Closed
1 of 4 tasks
stevehipwell opened this issue Jun 24, 2020 · 17 comments · Fixed by #1224
Closed
1 of 4 tasks

Instance Refresh for EC2 Auto Scaling #929

stevehipwell opened this issue Jun 24, 2020 · 17 comments · Fixed by #1224

Comments

@stevehipwell
Copy link
Contributor

I'm submitting a...

  • bug report
  • feature request
  • support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

When this module makes a change to the worker group launch template an external action is required to update the workers to use the new launch template.

If this is a bug, how to reproduce? Please include a code sample if relevant.

n/a

What's the expected behavior?

I'm interested to see if the new Instance Refresh for EC2 Auto Scaling would be something that could be leveraged by this module to automate worker upgrades?

Currently the best solution I've found for this is the HelloFresh EKS Rolling Update but due to the fact that this is synchronous and time consuming it doesn't seem like a good fit for a Terraform module. I'd be interested to see if Instance Refresh for EC2 Auto Scaling is an alternative that can be embedded in a Terraform module.

Are you able to fix this problem and submit a PR? Link here if you have already.

n/a

Environment details

n/a

Any other relevant info

n/a

@dpiddockcmp
Copy link
Contributor

Blocked first on support being available in the AWS provider. Looks like there's a WIP PR over there: hashicorp/terraform-provider-aws#13791

The wiring for this would be complicated if you want to avoid service interruption. You need to inform kubernetes that a node is about to go away via a drain call. Ideally you need spare capacity or the replacement node already running and fully joined to the cluster to avoid downtime. How would the notification for new-node-ready be linked to something as primitive as this instance refresh? A rolling refresh of this type would potentially cause a lot of pod interruption unless you cordon the whole ASG first. But then what happens if the rollout failed and all your nodes are cordoned?

I think this is outside of the scope of this module. Look at the complexity involved in #937 for setting up node draining.

Maybe some day the managed node groups feature will be useful enough for everybody? Although you still need a system outside of terraform to trigger a rollout of them.

@stevehipwell
Copy link
Contributor Author

@dpiddockcmp I think if the AWS node termination handler added support (aws/aws-node-termination-handler#181) then this module could provide the trigger to start the process as a user option.

@stale
Copy link

stale bot commented Oct 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 1, 2020
@barryib barryib removed the stale label Oct 4, 2020
@barryib
Copy link
Member

barryib commented Nov 12, 2020

For the records:

@stevehipwell
Copy link
Contributor Author

@barryib thanks for the update.

@gdavison
Copy link

Hi all. We've just merged hashicorp/terraform-provider-aws#16678. Instance Refresh support will be shipping in AWS Provider v3.22.0 today

@brennerm
Copy link

brennerm commented Jan 29, 2021

Would be willing to take care of integrating the new instance refresh support into this module.

I'd introduce a flag to allow to enable/disable the instance refresh. What should be the default?

What would be sane defaults for warmup time and min healthy percentage?

@ffjia
Copy link
Contributor

ffjia commented Feb 4, 2021

@brennerm guess the default should be enable. I have another question, how did you test it? did you add other trigger for instance refresh?

@bashims
Copy link
Contributor

bashims commented Feb 4, 2021

@brennerm Is this work in progress?

@brennerm
Copy link

brennerm commented Feb 4, 2021

@ffjia Not sure about that. Think of people that are running stateful applications that can't handle being shutdown at any time. Additionally it would introduce a change to the current behavior of this module that users may not expect. Not tested it yet but I would simply make a change to the launch template/configuration and see if the instances refresh

@bashims Not yet, but I'd start this weekend.

FYI would go with the default values of AWS regarding warmup time and min healthy percentage.

@bashims
Copy link
Contributor

bashims commented Feb 4, 2021

@brennerm thanks for the quick reply! I am mostly done with the updates since we need it asap. Mind if I lend a hand here?

I agree that it should not be enabled by default.

@brennerm
Copy link

brennerm commented Feb 4, 2021

Sure go ahead. No need to make the work twice.

@ffjia
Copy link
Contributor

ffjia commented Feb 7, 2021

@ffjia Not sure about that. Think of people that are running stateful applications that can't handle being shutdown at any time. Additionally it would introduce a change to the current behavior of this module that users may not expect. Not tested it yet but I would simply make a change to the launch template/configuration and see if the instances refresh

According to the Terraform document, instance refresh will be triggered by launch_configuration, launch_template, andmixed_instances_policy, I don't know if there are scenarios that someone would update those resources and won't want to replace EC2 instances. For sure it'll introduce new behavior that user may not expected, but in a good way, I think it's better than bringing up a new ASG, and destroying the old ASG.

@bashims
Copy link
Contributor

bashims commented Feb 7, 2021

I've tested the instance_refresh feature and it definitely resolves our issue with respect to managing a consistent set of nodes. My current PR has left the feature off by default so it should not impact those who are not ready to move forward with automatic instance refresh.

@ffjia I agree with you. ASG recreation was always a bit of a hack to work around a missing feature in AWS. Unfortunately we have incurred downtime when when transitioning between ASGs and will greatly benefit from the instance refresh feature.

@stale
Copy link

stale bot commented May 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 8, 2021
@stevehipwell
Copy link
Contributor Author

This should be left open as PR #1224 will close.

@stale stale bot removed the stale label May 10, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants