How to handle worker upgrade automatically ? #462

barryib · 2019-08-05T10:18:52Z

I have issues

This module is great for deploying EKS clusters, but it has taken the decision to leave the worker upgrade out of its scope. This is ok for certain users, but for us, dealing manually with worker upgrade is a painful and repetitive task, mostly when you have a lot of workers.

This issue, is more like a discussion to decide if we want to implement this and how we should handle it.

I'm submitting a...

bug report
feature request
support request
kudos, thank you, warm fuzzy

Are you able to fix this problem and submit a PR? Link here if you have already.

Yes, I'll be happy to submit PRs for this. But before that, I want to know what direction I (we) should take for this.

To handle this, I would like to use autocaling group lifecycle hooks to drain nodes during scale in. I want to use a lambda function which will subscribe to autoscaling:EC2_INSTANCE_TERMINATING events and drain nodes before ASG terminates EC2 instances.

There is already a good proof of concept in aws-samples, called amazon k8s node drainer.

By using ASG lifecycle hooks, we can achieve what @max-rocket-internet proposed #412 (comment).

And by using both hooks and cloud formation, we can tackle #333 (comment).

So, my point is NOT to handle all these with this module, but I think it should allow users to decide whether or not to scale in nodes after an LT change and let them handle node draining.

My questions here are :

How to handle node scale in ? Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ? This will only prefix the ASG name with LT name.
For users who wants to have more control on upgrades, are cloud formation templates welcome? This can add rolling upgrades (node by node) and can fit nicely with autocaling group lifecycle hooks.

Any other relevant info

here are additional links for ASG lifecycle hooks :

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2019-08-07T11:20:01Z

Thanks for the detailed issue @barryib!

So, my point is to handle all these with this module,

I'm skeptical about including a Lambda function and all the other bits like SNS/SQS in this repo. It won't be simple and I'm not sure it belongs in the scope of this TF module. But we could just include an optional aws_autoscaling_lifecycle_hook resource with its settings and notification_target_arn coming from external.

Questions:

I would like to use autocaling group lifecycle hooks to drain nodes during scale in.

At what point is the ASG ever choosing to terminate instances? In my experience the cluster-autoscaler tells the ASG what instance to remove after it drains the node.

and cloud formation

Where and why is CFN involved here?

Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ?

But as soon as the ASG is deleted, all instances are terminated?

are cloud formation templates welcome?

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

RothAndrew · 2019-08-07T11:22:01Z

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

Rock hard

max-rocket-internet · 2019-08-07T11:25:22Z

OK to answer my own questions..

At what point is the ASG ever choosing to terminate instances?

This can only be achieved with rollingupdate from CFN. As I understand this is not achievable in TF.

Where and why is CFN involved here?

See above

Overall I like the idea. I love AWS Lambda. I would like to have automation around this process. But IIRC, this is what you are proposing:

ASG and LC/LTs must be controlled by CFN
Include AWS Lambda function
Include associated Lambda trigger resources (SNS or SQS, IAM policy etc)

I'm always open to other opinions, so add yours if it's missing, but I think most people won't be happy with this direction.

max-rocket-internet · 2019-08-07T11:27:49Z

Also, the node update process really isn't that difficult? I mean you could script what I wrote here in about 10 lines of shell, right?

mlafeldt · 2019-08-07T11:29:20Z

Where and why is CFN involved here?

Unlike Terraform, CloudFormation allows you to replace nodes in batches of N instances (plus you have resource signaling to indicate that an instance is actually ready). When N is 1 and you have some mechanism like the mentioned node drainer, you can safely update all worker nodes with minimum disruption.

I recommend reading https://medium.com/@endofcake/using-terraform-for-zero-downtime-updates-of-an-auto-scaling-group-in-aws-60faca582664 on the subject.

RothAndrew · 2019-08-07T11:32:15Z

It sounds interesting, but yeah, that's a no from me dawg (RandyJacksonMemeHere)

barryib · 2019-08-07T12:16:12Z

@max-rocket-internet

So, my point is to handle all these with this module,

Sorry for the typo. My point is not to handle all these with this module. In fact, this module, should only provide something to trigger change and why not an option to create initial life cycle hook.

But as soon as the ASG is deleted, all instances are terminated?

Yes. This is what I want. When coupled with life cycle hook, ASG doesn't terminate instance directly, It'll only put EC2 instance Pending:Wait or Terminating:Wait. From there, you can run custom action with a lambda per example.

@RothAndrew For the cloudformation discussion, this is quite a long debate. I don't like it either, but it give us more flexibility on EC2 upgrade. This is a fact ! In addition to @mlafeldt #462 (comment), I'll say upgrading node by node, can also prevent you from hitting the EC2 resource limit.

@max-rocket-internet So my proposal here so far, is to only add:

an option to recreate ASG when LT or LC change => PR add option to recreate ASG when LT or LC changes #465
an option to create initial life cycle hook => PR add initial lifecycle hooks for autosacling groups #466

RothAndrew · 2019-08-07T12:26:52Z

I was opposed to adding cloudformation to this terraform module. This sounds more reasonable. I think it's still need some more discussion but I'm less disgusted by the idea now. I share Max's hatred for all things cloudformation

max-rocket-internet · 2019-08-07T14:08:22Z

an option to recreate ASG when LT or LC change => PR #465

Looks OK to me. Even outside of k8s node updates this would be useful. Still keen to see what others think though

an option to create initial life cycle hook

As I understand, this is pointless without the CFN part?

I don't like it either, but it give us more flexibility on EC2 upgrade

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining than move to CFN, Lambda etc 😅

barryib · 2019-08-07T14:56:35Z

As I understand, this is pointless without the CFN part?

This is not related to cloudformation at all. Life cycle hook is a autoscaling group functionality. This can be added during the ASG creation https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#initial_lifecycle_hook

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining that move to CFN, Lambda etc 😅

I totally disagree with the term "overkill" here. We are just offering user the ability to use AWS functionalities (like lambda which is quite standard today) to achieve something on EKS worker nodes.

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

mlafeldt · 2019-08-13T09:05:42Z

FWIW, I actually tested the K8s node drainer with Terraform/CloudFormation: #333 (comment)

max-rocket-internet · 2019-08-13T10:21:43Z

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

OK cool!

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

Great. Thanks for the effort, let's merge these 😃

stale · 2020-01-03T22:38:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

barryib · 2020-01-04T10:14:31Z

In the next couple of days, I'll add a small doc with link to some projects and issues to track to achieve this.

manfredlift · 2020-01-22T17:46:52Z

@barryib, interested in learning about your findings and what's the best and cleanest way to achieve worker upgrades.

stale · 2020-04-22T18:49:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

barryib · 2020-04-23T17:38:16Z

/remove stale

stale · 2020-07-22T18:34:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-08-21T19:04:34Z

This issue has been automatically closed because it has not had recent activity since being marked as stale.

barryib · 2021-05-28T20:48:16Z

Closing since the instance_refresh is the recommended way to do this for self-managed worker groups.

github-actions · 2022-11-21T02:28:05Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

barryib mentioned this issue Aug 6, 2019

ASG node rotation after launch configuration / template change #333

Closed

4 tasks

barryib mentioned this issue Aug 7, 2019

add option to recreate ASG when LT or LC changes #465

Merged

4 tasks

barryib mentioned this issue Aug 7, 2019

add initial lifecycle hooks for autosacling groups #466

Merged

4 tasks

max-rocket-internet mentioned this issue Aug 19, 2019

Upgrading worker_groups instance_type does not apply. Ok with pod downtime. #470

Closed

stale bot added the stale label Jan 3, 2020

stale bot removed the stale label Jan 4, 2020

stale bot added the stale label Apr 22, 2020

barryib removed the stale label Apr 22, 2020

karlderkaefer mentioned this issue Jun 29, 2020

docs: add example for node draining #937

Closed

2 tasks

stale bot added the stale label Jul 22, 2020

stale bot closed this as completed Aug 21, 2020

barryib reopened this May 6, 2021

stale bot removed the stale label May 6, 2021

barryib mentioned this issue May 6, 2021

feat: Add support for Auto Scaling Group Instance Refresh for self-managed worker groups #1224

Merged

2 tasks

barryib closed this as completed May 28, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle worker upgrade automatically ? #462

How to handle worker upgrade automatically ? #462

barryib commented Aug 5, 2019 •

edited

Loading

max-rocket-internet commented Aug 7, 2019 •

edited

Loading

RothAndrew commented Aug 7, 2019

max-rocket-internet commented Aug 7, 2019

max-rocket-internet commented Aug 7, 2019

mlafeldt commented Aug 7, 2019

RothAndrew commented Aug 7, 2019 •

edited

Loading

barryib commented Aug 7, 2019 •

edited

Loading

RothAndrew commented Aug 7, 2019 •

edited

Loading

max-rocket-internet commented Aug 7, 2019 •

edited

Loading

barryib commented Aug 7, 2019

mlafeldt commented Aug 13, 2019

max-rocket-internet commented Aug 13, 2019

stale bot commented Jan 3, 2020

barryib commented Jan 4, 2020

manfredlift commented Jan 22, 2020

stale bot commented Apr 22, 2020

barryib commented Apr 23, 2020

stale bot commented Jul 22, 2020

stale bot commented Aug 21, 2020

barryib commented May 28, 2021

github-actions bot commented Nov 21, 2022

How to handle worker upgrade automatically ? #462

How to handle worker upgrade automatically ? #462

Comments

barryib commented Aug 5, 2019 • edited Loading

I have issues

I'm submitting a...

Are you able to fix this problem and submit a PR? Link here if you have already.

Any other relevant info

max-rocket-internet commented Aug 7, 2019 • edited Loading

RothAndrew commented Aug 7, 2019

max-rocket-internet commented Aug 7, 2019

max-rocket-internet commented Aug 7, 2019

mlafeldt commented Aug 7, 2019

RothAndrew commented Aug 7, 2019 • edited Loading

barryib commented Aug 7, 2019 • edited Loading

RothAndrew commented Aug 7, 2019 • edited Loading

max-rocket-internet commented Aug 7, 2019 • edited Loading

barryib commented Aug 7, 2019

mlafeldt commented Aug 13, 2019

max-rocket-internet commented Aug 13, 2019

stale bot commented Jan 3, 2020

barryib commented Jan 4, 2020

manfredlift commented Jan 22, 2020

stale bot commented Apr 22, 2020

barryib commented Apr 23, 2020

stale bot commented Jul 22, 2020

stale bot commented Aug 21, 2020

barryib commented May 28, 2021

github-actions bot commented Nov 21, 2022

barryib commented Aug 5, 2019 •

edited

Loading

max-rocket-internet commented Aug 7, 2019 •

edited

Loading

RothAndrew commented Aug 7, 2019 •

edited

Loading

barryib commented Aug 7, 2019 •

edited

Loading

RothAndrew commented Aug 7, 2019 •

edited

Loading

max-rocket-internet commented Aug 7, 2019 •

edited

Loading