Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Auto Scaling Group Instance Refresh for self-managed worker groups #1224

Merged

Conversation

bashims
Copy link
Contributor

@bashims bashims commented Feb 5, 2021

PR o'clock

Description

Add support for the instance_refresh feature of the AWS provisioner added in hashicorp/terraform-provider-aws#16678

This PR resolves #929

See:
https://aws.amazon.com/blogs/compute/introducing-instance-refresh-for-ec2-auto-scaling/

Checklist

@bashims bashims force-pushed the 929-support-instance-refresh branch from 0f8b9da to 94dedbd Compare February 5, 2021 00:07
@bashims bashims marked this pull request as ready for review February 5, 2021 00:10
@stevehipwell
Copy link
Contributor

@bashims does the phased deployments anouncement make any difference to this PR?

It looks like you haven't configured the ASGs to work with the aws-node-termination-handler (queue processor mode, chart)? I would have expected this to be a requirement as you wouldn't want a node to suddenly terminate with no warning.

@bashims
Copy link
Contributor Author

bashims commented Mar 26, 2021

@bashims does the phased deployments anouncement make any difference to this PR?

Good question! I am not too sure if the checkpoint feature is available in the AWS provider. For now it seems to support the original instance refresh feature.

It looks like you haven't configured the ASGs to work with the aws-node-termination-handler (queue processor mode, chart)? I would have expected this to be a requirement as you wouldn't want a node to suddenly terminate with no warning.

Good point. I can add something to the docs to mention that one should deploy the aws-node-termination-handler chart when enabling instance refresh. The eks TF module does not provide support for deploying charts etc, so I doubt it would be a good fit here.

@stevehipwell
Copy link
Contributor

@bashims I wasn't refering to actually installing the chart but you'll need to set up a SQS queue for the events and then provide the chart installation instructions to work with instance refresh (specifically controller not daemonset mode).

@bashims
Copy link
Contributor Author

bashims commented Mar 29, 2021

@bashims I wasn't refering to actually installing the chart but you'll need to set up a SQS queue for the events and then provide the chart installation instructions to work with instance refresh (specifically controller not daemonset mode).

@stevehipwell Awesome feedback. I was not aware. I will update this PR to include this feature.

@Erokos
Copy link

Erokos commented Apr 22, 2021

Hi, what's the status with this PR?

@Erokos
Copy link

Erokos commented May 6, 2021

Hi,
one other question regarding the work in this PR is, when this gets merges and used, will the "asg_recreate_on_change" param still be used/necessary?

@barryib barryib self-assigned this May 6, 2021
@barryib
Copy link
Member

barryib commented May 6, 2021

Hi,
one other question regarding the work in this PR is, when this gets merges and used, will the "asg_recreate_on_change" param still be used/necessary?

Good question. I don't think so. I think with this addition we can drop the asg_recreate_on_change and random pet names.

@bashims @stevehipwell I think we should add in docs that using instance refresh will recreate your nodes, but it doesn't drains them. To do that, users should install https://github.com/aws/aws-node-termination-handler

Related to:

@stevehipwell
Copy link
Contributor

@bashims @barryib I assume that this PR will add the SQS queue that aws-node-termination-handler needs. So if instance refresh is turned on you need to install aws-node-termination-handler for draining but everything else works?

@Erokos
Copy link

Erokos commented May 6, 2021

I wholeheartedly agree. From reading the issues, I've come across references on random_pet usage and from the docs it isn't highlighted that using asg_recreate_on_change doesn't drain nodes and in my opinion it should say so. To some it may be obvious but a warning never hurt nobody 🙂. I could be wrong but without having to manage the draining yourself, by having a shell script within a null resource (with a remote-exec) or something similar, I don't see how else it could be done.

@barryib
Copy link
Member

barryib commented May 6, 2021

@bashims @barryib I assume that this PR will add the SQS queue that aws-node-termination-handler needs. So if instance refresh is turned on you need to install aws-node-termination-handler for draining but everything else works?

Humm... But users should handle aws-node-terminination-handler installation with all requirements themselves. In this module, we'll only support changes in worker ASG like:

  • instance refresh configuration
  • lifecycle hooks configuration

Everything else should be handled outside of this module. We can write docs for that. If someone willing to write a module to address aws-node-termination-handler installation, we can probably include that link in docs.

@stevehipwell
Copy link
Contributor

@barryib I think the SQS linking is part of the ASG configuration, so at the minimum that would need to be passed into the module so the aws-node-termination-handler can be linked up correctly.

@barryib
Copy link
Member

barryib commented May 6, 2021

@barryib I think the SQS linking is part of the ASG configuration, so at the minimum that would need to be passed into the module so the aws-node-termination-handler can be linked up correctly.

Yep. It'll be part of the lifecycle hook configuration https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L37

@bashims
Copy link
Contributor Author

bashims commented May 6, 2021

Hi,
one other question regarding the work in this PR is, when this gets merges and used, will the "asg_recreate_on_change" param still be used/necessary?

Good question. I don't think so. I think with this addition we can drop the asg_recreate_on_change and random pet names.

@bashims @stevehipwell I think we should add in docs that using instance refresh will recreate your nodes, but it doesn't drains them. To do that, users should install https://github.com/aws/aws-node-termination-handler

Related to:

* #462

I can take care of that in this PR.

@barryib
Copy link
Member

barryib commented May 6, 2021

I can take care of that in this PR.

Thanks @bashims. Can you please:

  • Update docs to highlight the need of aws-node-termination handler for draining
  • Open a new PR (if you have time) to drop asg_recreate_on_change and random_pets in for LT and LC

variables.tf Outdated Show resolved Hide resolved
workers.tf Outdated Show resolved Hide resolved
@bashims
Copy link
Contributor Author

bashims commented May 7, 2021

I can take care of that in this PR.

Thanks @bashims. Can you please:

* Update docs to highlight the need of aws-node-termination handler for draining

* Open a new PR (if you have time) to drop `asg_recreate_on_change` and random_pets in for LT and LC

Absolutely.

@bashims bashims force-pushed the 929-support-instance-refresh branch from 94dedbd to 6b3b135 Compare May 12, 2021 17:25
@bashims
Copy link
Contributor Author

bashims commented May 12, 2021

@bashims @barryib I assume that this PR will add the SQS queue that aws-node-termination-handler needs. So if instance refresh is turned on you need to install aws-node-termination-handler for draining but everything else works?

Humm... But users should handle aws-node-terminination-handler installation with all requirements themselves. In this module, we'll only support changes in worker ASG like:

* instance refresh configuration

* lifecycle hooks configuration

Everything else should be handled outside of this module. We can write docs for that. If someone willing to write a module to address aws-node-termination-handler installation, we can probably include that link in docs.

@bashims bashims closed this May 12, 2021
@bashims bashims reopened this May 12, 2021
@bashims
Copy link
Contributor Author

bashims commented May 13, 2021

@barryib I think this PR is nearly ready. I am working on complete example that includes necessary setup for graceful node draining with instance refresh. I should have this done soon.

local.tf Outdated Show resolved Hide resolved
local.tf Outdated Show resolved Hide resolved
@bashims bashims force-pushed the 929-support-instance-refresh branch from 36285a0 to e15be3d Compare May 15, 2021 10:24
@bashims bashims force-pushed the 929-support-instance-refresh branch from e15be3d to e40fe2b Compare May 15, 2021 11:17
@bashims
Copy link
Contributor Author

bashims commented May 15, 2021

I think that this PR is ready to go. The complete example works perfectly.

Just waiting on workflow approval, and final review. Once that is done I will open a PR for dropping asg_recreate_on_change and random pet names.

examples/instance_refresh/main.tf Outdated Show resolved Hide resolved
examples/instance_refresh/main.tf Outdated Show resolved Hide resolved
@barryib
Copy link
Member

barryib commented May 17, 2021

@bashims I just opened #1360 to address the random pets removal.

@barryib
Copy link
Member

barryib commented May 17, 2021

@bashims you need to update your branch and resolve conflicts.

@bashims
Copy link
Contributor Author

bashims commented May 17, 2021

@barryib are you okay if I squash and rebase off of master?

@barryib
Copy link
Member

barryib commented May 17, 2021

@barryib are you okay if I squash and rebase off of master?

Yep.

@bashims bashims force-pushed the 929-support-instance-refresh branch from 2a71d44 to 05835f8 Compare May 17, 2021 16:49
examples/instance_refresh/main.tf Outdated Show resolved Hide resolved
examples/instance_refresh/main.tf Outdated Show resolved Hide resolved
@barryib barryib changed the title feat: add support for ASG instance refresh for workers feat: Add support for Auto Scaling Group Instance Refresh for self-managed worker groups May 17, 2021
@barryib barryib merged commit 68e9df9 into terraform-aws-modules:master May 17, 2021
@barryib
Copy link
Member

barryib commented May 17, 2021

Thanks @bashims for your contributions.

@barryib
Copy link
Member

barryib commented May 17, 2021

This now shipped in v16.0.0

@bashims bashims deleted the 929-support-instance-refresh branch May 19, 2021 14:44
barryib added a commit to barryib/terraform-aws-eks that referenced this pull request May 20, 2021
ArchiFleKs pushed a commit to ArchiFleKs/terraform-aws-eks that referenced this pull request Jun 1, 2021
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instance Refresh for EC2 Auto Scaling
7 participants