Instance Refresh for EC2 Auto Scaling #929

stevehipwell · 2020-06-24T09:32:27Z

I'm submitting a...

bug report
feature request
support request - read the FAQ first!
kudos, thank you, warm fuzzy

What is the current behavior?

When this module makes a change to the worker group launch template an external action is required to update the workers to use the new launch template.

If this is a bug, how to reproduce? Please include a code sample if relevant.

n/a

What's the expected behavior?

I'm interested to see if the new Instance Refresh for EC2 Auto Scaling would be something that could be leveraged by this module to automate worker upgrades?

Currently the best solution I've found for this is the HelloFresh EKS Rolling Update but due to the fact that this is synchronous and time consuming it doesn't seem like a good fit for a Terraform module. I'd be interested to see if Instance Refresh for EC2 Auto Scaling is an alternative that can be embedded in a Terraform module.

Are you able to fix this problem and submit a PR? Link here if you have already.

n/a

Environment details

n/a

Any other relevant info

n/a

dpiddockcmp · 2020-07-03T15:09:43Z

Blocked first on support being available in the AWS provider. Looks like there's a WIP PR over there: hashicorp/terraform-provider-aws#13791

The wiring for this would be complicated if you want to avoid service interruption. You need to inform kubernetes that a node is about to go away via a drain call. Ideally you need spare capacity or the replacement node already running and fully joined to the cluster to avoid downtime. How would the notification for new-node-ready be linked to something as primitive as this instance refresh? A rolling refresh of this type would potentially cause a lot of pod interruption unless you cordon the whole ASG first. But then what happens if the rollout failed and all your nodes are cordoned?

I think this is outside of the scope of this module. Look at the complexity involved in #937 for setting up node draining.

Maybe some day the managed node groups feature will be useful enough for everybody? Although you still need a system outside of terraform to trigger a rollout of them.

stevehipwell · 2020-07-03T16:59:20Z

@dpiddockcmp I think if the AWS node termination handler added support (aws/aws-node-termination-handler#181) then this module could provide the trigger to start the process as a user option.

stale · 2020-10-01T17:57:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

barryib · 2020-11-12T22:17:49Z

For the records:

The AWS node termination handler now support non spot instance. See the Node Termination Handler Queue Processor mode.
When the instance refresh resource will be ready, we can definitely use them together for smooth rolling upgrade.

stevehipwell · 2020-11-12T22:31:34Z

@barryib thanks for the update.

gdavison · 2020-12-18T21:39:07Z

Hi all. We've just merged hashicorp/terraform-provider-aws#16678. Instance Refresh support will be shipping in AWS Provider v3.22.0 today

brennerm · 2021-01-29T11:15:01Z

Would be willing to take care of integrating the new instance refresh support into this module.

I'd introduce a flag to allow to enable/disable the instance refresh. What should be the default?

What would be sane defaults for warmup time and min healthy percentage?

ffjia · 2021-02-04T01:13:05Z

@brennerm guess the default should be enable. I have another question, how did you test it? did you add other trigger for instance refresh?

bashims · 2021-02-04T19:38:32Z

@brennerm Is this work in progress?

brennerm · 2021-02-04T20:03:37Z

@ffjia Not sure about that. Think of people that are running stateful applications that can't handle being shutdown at any time. Additionally it would introduce a change to the current behavior of this module that users may not expect. Not tested it yet but I would simply make a change to the launch template/configuration and see if the instances refresh

@bashims Not yet, but I'd start this weekend.

FYI would go with the default values of AWS regarding warmup time and min healthy percentage.

bashims · 2021-02-04T20:29:18Z

@brennerm thanks for the quick reply! I am mostly done with the updates since we need it asap. Mind if I lend a hand here?

I agree that it should not be enabled by default.

brennerm · 2021-02-04T20:32:24Z

Sure go ahead. No need to make the work twice.

ffjia · 2021-02-07T08:02:02Z

@ffjia Not sure about that. Think of people that are running stateful applications that can't handle being shutdown at any time. Additionally it would introduce a change to the current behavior of this module that users may not expect. Not tested it yet but I would simply make a change to the launch template/configuration and see if the instances refresh

According to the Terraform document, instance refresh will be triggered by launch_configuration, launch_template, andmixed_instances_policy, I don't know if there are scenarios that someone would update those resources and won't want to replace EC2 instances. For sure it'll introduce new behavior that user may not expected, but in a good way, I think it's better than bringing up a new ASG, and destroying the old ASG.

bashims · 2021-02-07T15:42:32Z

I've tested the instance_refresh feature and it definitely resolves our issue with respect to managing a consistent set of nodes. My current PR has left the feature off by default so it should not impact those who are not ready to move forward with automatic instance refresh.

@ffjia I agree with you. ASG recreation was always a bit of a hack to work around a missing feature in AWS. Unfortunately we have incurred downtime when when transitioning between ASGs and will greatly benefit from the instance refresh feature.

stale · 2021-05-08T21:15:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stevehipwell · 2021-05-10T05:57:29Z

This should be left open as PR #1224 will close.

github-actions · 2022-11-21T02:28:48Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

stale bot added the stale label Oct 1, 2020

barryib removed the stale label Oct 4, 2020

bashims mentioned this issue Feb 5, 2021

feat: Add support for Auto Scaling Group Instance Refresh for self-managed worker groups #1224

Merged

2 tasks

stale bot added the stale label May 8, 2021

stale bot removed the stale label May 10, 2021

barryib closed this as completed in #1224 May 17, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Refresh for EC2 Auto Scaling #929

Instance Refresh for EC2 Auto Scaling #929

stevehipwell commented Jun 24, 2020

dpiddockcmp commented Jul 3, 2020

stevehipwell commented Jul 3, 2020

stale bot commented Oct 1, 2020

barryib commented Nov 12, 2020

stevehipwell commented Nov 12, 2020

gdavison commented Dec 18, 2020

brennerm commented Jan 29, 2021 •

edited

Loading

ffjia commented Feb 4, 2021

bashims commented Feb 4, 2021

brennerm commented Feb 4, 2021

bashims commented Feb 4, 2021

brennerm commented Feb 4, 2021

ffjia commented Feb 7, 2021

bashims commented Feb 7, 2021

stale bot commented May 8, 2021

stevehipwell commented May 10, 2021

github-actions bot commented Nov 21, 2022

Instance Refresh for EC2 Auto Scaling #929

Instance Refresh for EC2 Auto Scaling #929

Comments

stevehipwell commented Jun 24, 2020

I'm submitting a...

What is the current behavior?

If this is a bug, how to reproduce? Please include a code sample if relevant.

What's the expected behavior?

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

Any other relevant info

dpiddockcmp commented Jul 3, 2020

stevehipwell commented Jul 3, 2020

stale bot commented Oct 1, 2020

barryib commented Nov 12, 2020

stevehipwell commented Nov 12, 2020

gdavison commented Dec 18, 2020

brennerm commented Jan 29, 2021 • edited Loading

ffjia commented Feb 4, 2021

bashims commented Feb 4, 2021

brennerm commented Feb 4, 2021

bashims commented Feb 4, 2021

brennerm commented Feb 4, 2021

ffjia commented Feb 7, 2021

bashims commented Feb 7, 2021

stale bot commented May 8, 2021

stevehipwell commented May 10, 2021

github-actions bot commented Nov 21, 2022

brennerm commented Jan 29, 2021 •

edited

Loading