Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle worker upgrade automatically ? #462

Closed
2 of 4 tasks
barryib opened this issue Aug 5, 2019 · 21 comments
Closed
2 of 4 tasks

How to handle worker upgrade automatically ? #462

barryib opened this issue Aug 5, 2019 · 21 comments

Comments

@barryib
Copy link
Member

barryib commented Aug 5, 2019

I have issues

This module is great for deploying EKS clusters, but it has taken the decision to leave the worker upgrade out of its scope. This is ok for certain users, but for us, dealing manually with worker upgrade is a painful and repetitive task, mostly when you have a lot of workers.

This issue, is more like a discussion to decide if we want to implement this and how we should handle it.

I'm submitting a...

  • bug report
  • feature request
  • support request
  • kudos, thank you, warm fuzzy

Are you able to fix this problem and submit a PR? Link here if you have already.

Yes, I'll be happy to submit PRs for this. But before that, I want to know what direction I (we) should take for this.

To handle this, I would like to use autocaling group lifecycle hooks to drain nodes during scale in. I want to use a lambda function which will subscribe to autoscaling:EC2_INSTANCE_TERMINATING events and drain nodes before ASG terminates EC2 instances.

There is already a good proof of concept in aws-samples, called amazon k8s node drainer.

By using ASG lifecycle hooks, we can achieve what @max-rocket-internet proposed #412 (comment).

And by using both hooks and cloud formation, we can tackle #333 (comment).

So, my point is NOT to handle all these with this module, but I think it should allow users to decide whether or not to scale in nodes after an LT change and let them handle node draining.

My questions here are :

  1. How to handle node scale in ? Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ? This will only prefix the ASG name with LT name.
  2. For users who wants to have more control on upgrades, are cloud formation templates welcome? This can add rolling upgrades (node by node) and can fit nicely with autocaling group lifecycle hooks.

Any other relevant info

here are additional links for ASG lifecycle hooks :

@max-rocket-internet
Copy link
Contributor

max-rocket-internet commented Aug 7, 2019

Thanks for the detailed issue @barryib!

So, my point is to handle all these with this module,

I'm skeptical about including a Lambda function and all the other bits like SNS/SQS in this repo. It won't be simple and I'm not sure it belongs in the scope of this TF module. But we could just include an optional aws_autoscaling_lifecycle_hook resource with its settings and notification_target_arn coming from external.

Questions:

I would like to use autocaling group lifecycle hooks to drain nodes during scale in.

At what point is the ASG ever choosing to terminate instances? In my experience the cluster-autoscaler tells the ASG what instance to remove after it drains the node.

and cloud formation

Where and why is CFN involved here?

Can I add a variable like worker_recreate_asg_when_lt_changes to let terraform recreate the ASG when the aws_launch_template changes ?

But as soon as the ASG is deleted, all instances are terminated?

are cloud formation templates welcome?

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

@RothAndrew
Copy link
Contributor

I would say hard no as I hate CFN 😆 but open to hear other people's opinions.

Rock hard

@max-rocket-internet
Copy link
Contributor

OK to answer my own questions..

At what point is the ASG ever choosing to terminate instances?

This can only be achieved with rollingupdate from CFN. As I understand this is not achievable in TF.

Where and why is CFN involved here?

See above

Overall I like the idea. I love AWS Lambda. I would like to have automation around this process. But IIRC, this is what you are proposing:

  • ASG and LC/LTs must be controlled by CFN
  • Include AWS Lambda function
  • Include associated Lambda trigger resources (SNS or SQS, IAM policy etc)

I'm always open to other opinions, so add yours if it's missing, but I think most people won't be happy with this direction.

@max-rocket-internet
Copy link
Contributor

Also, the node update process really isn't that difficult? I mean you could script what I wrote here in about 10 lines of shell, right?

@mlafeldt
Copy link

mlafeldt commented Aug 7, 2019

Where and why is CFN involved here?

Unlike Terraform, CloudFormation allows you to replace nodes in batches of N instances (plus you have resource signaling to indicate that an instance is actually ready). When N is 1 and you have some mechanism like the mentioned node drainer, you can safely update all worker nodes with minimum disruption.

I recommend reading https://medium.com/@endofcake/using-terraform-for-zero-downtime-updates-of-an-auto-scaling-group-in-aws-60faca582664 on the subject.

@RothAndrew
Copy link
Contributor

RothAndrew commented Aug 7, 2019

It sounds interesting, but yeah, that's a no from me dawg (RandyJacksonMemeHere)

@barryib
Copy link
Member Author

barryib commented Aug 7, 2019

@max-rocket-internet

So, my point is to handle all these with this module,

Sorry for the typo. My point is not to handle all these with this module. In fact, this module, should only provide something to trigger change and why not an option to create initial life cycle hook.

But as soon as the ASG is deleted, all instances are terminated?

Yes. This is what I want. When coupled with life cycle hook, ASG doesn't terminate instance directly, It'll only put EC2 instance Pending:Wait or Terminating:Wait. From there, you can run custom action with a lambda per example.

@RothAndrew For the cloudformation discussion, this is quite a long debate. I don't like it either, but it give us more flexibility on EC2 upgrade. This is a fact ! In addition to @mlafeldt #462 (comment), I'll say upgrading node by node, can also prevent you from hitting the EC2 resource limit.

@max-rocket-internet So my proposal here so far, is to only add:

@RothAndrew
Copy link
Contributor

RothAndrew commented Aug 7, 2019

I was opposed to adding cloudformation to this terraform module. This sounds more reasonable. I think it's still need some more discussion but I'm less disgusted by the idea now. I share Max's hatred for all things cloudformation

@max-rocket-internet
Copy link
Contributor

max-rocket-internet commented Aug 7, 2019

an option to recreate ASG when LT or LC change => PR #465

Looks OK to me. Even outside of k8s node updates this would be useful. Still keen to see what others think though

an option to create initial life cycle hook

As I understand, this is pointless without the CFN part?

I don't like it either, but it give us more flexibility on EC2 upgrade

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining than move to CFN, Lambda etc 😅

@barryib
Copy link
Member Author

barryib commented Aug 7, 2019

As I understand, this is pointless without the CFN part?

This is not related to cloudformation at all. Life cycle hook is a autoscaling group functionality. This can be added during the ASG creation https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#initial_lifecycle_hook

I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining that move to CFN, Lambda etc 😅

I totally disagree with the term "overkill" here. We are just offering user the ability to use AWS functionalities (like lambda which is quite standard today) to achieve something on EKS worker nodes.

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

@mlafeldt
Copy link

FWIW, I actually tested the K8s node drainer with Terraform/CloudFormation: #333 (comment)

@max-rocket-internet
Copy link
Contributor

Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts.

OK cool!

As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves.

Great. Thanks for the effort, let's merge these 😃

@stale
Copy link

stale bot commented Jan 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 3, 2020
@barryib
Copy link
Member Author

barryib commented Jan 4, 2020

In the next couple of days, I'll add a small doc with link to some projects and issues to track to achieve this.

@stale stale bot removed the stale label Jan 4, 2020
@manfredlift
Copy link

@barryib, interested in learning about your findings and what's the best and cleanest way to achieve worker upgrades.

@stale
Copy link

stale bot commented Apr 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 22, 2020
@barryib barryib removed the stale label Apr 22, 2020
@barryib
Copy link
Member Author

barryib commented Apr 23, 2020

/remove stale

@stale
Copy link

stale bot commented Jul 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 22, 2020
@stale
Copy link

stale bot commented Aug 21, 2020

This issue has been automatically closed because it has not had recent activity since being marked as stale.

@stale stale bot closed this as completed Aug 21, 2020
@barryib
Copy link
Member Author

barryib commented May 28, 2021

Closing since the instance_refresh is the recommended way to do this for self-managed worker groups.

@barryib barryib closed this as completed May 28, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants