-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle worker upgrade automatically ? #462
Comments
Thanks for the detailed issue @barryib!
I'm skeptical about including a Lambda function and all the other bits like SNS/SQS in this repo. It won't be simple and I'm not sure it belongs in the scope of this TF module. But we could just include an optional Questions:
At what point is the ASG ever choosing to terminate instances? In my experience the cluster-autoscaler tells the ASG what instance to remove after it drains the node.
Where and why is CFN involved here?
But as soon as the ASG is deleted, all instances are terminated?
I would say hard no as I hate CFN 😆 but open to hear other people's opinions. |
Rock hard |
OK to answer my own questions..
This can only be achieved with rollingupdate from CFN. As I understand this is not achievable in TF.
See above Overall I like the idea. I love AWS Lambda. I would like to have automation around this process. But IIRC, this is what you are proposing:
I'm always open to other opinions, so add yours if it's missing, but I think most people won't be happy with this direction. |
Also, the node update process really isn't that difficult? I mean you could script what I wrote here in about 10 lines of shell, right? |
Unlike Terraform, CloudFormation allows you to replace nodes in batches of N instances (plus you have resource signaling to indicate that an instance is actually ready). When N is 1 and you have some mechanism like the mentioned node drainer, you can safely update all worker nodes with minimum disruption. I recommend reading https://medium.com/@endofcake/using-terraform-for-zero-downtime-updates-of-an-auto-scaling-group-in-aws-60faca582664 on the subject. |
It sounds interesting, but yeah, that's a no from me dawg (RandyJacksonMemeHere) |
Sorry for the typo. My point is not to handle all these with this module. In fact, this module, should only provide something to trigger change and why not an option to create initial life cycle hook.
Yes. This is what I want. When coupled with life cycle hook, ASG doesn't terminate instance directly, It'll only put EC2 instance Pending:Wait or Terminating:Wait. From there, you can run custom action with a lambda per example. @RothAndrew For the cloudformation discussion, this is quite a long debate. I don't like it either, but it give us more flexibility on EC2 upgrade. This is a fact ! In addition to @mlafeldt #462 (comment), I'll say upgrading node by node, can also prevent you from hitting the EC2 resource limit. @max-rocket-internet So my proposal here so far, is to only add:
|
I was opposed to adding cloudformation to this terraform module. This sounds more reasonable. I think it's still need some more discussion but I'm less disgusted by the idea now. I share Max's hatred for all things cloudformation |
Looks OK to me. Even outside of k8s node updates this would be useful. Still keen to see what others think though
As I understand, this is pointless without the CFN part?
I'm still thinking this is overkill for something that can be done in a short script. I'd rather have TF run a script to cycle through node draining than move to CFN, Lambda etc 😅 |
This is not related to cloudformation at all. Life cycle hook is a autoscaling group functionality. This can be added during the ASG creation https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#initial_lifecycle_hook
I totally disagree with the term "overkill" here. We are just offering user the ability to use AWS functionalities (like lambda which is quite standard today) to achieve something on EKS worker nodes. Don't get me wrong, the purpose of this module is not to create lambda functions or any notifications with cloudwatch event or SNS. Those will be created and maintained by user with its own TF scripts. As you noticed, I have added #465 and #466 to give to users to ability to handle this by them selves. |
FWIW, I actually tested the K8s node drainer with Terraform/CloudFormation: #333 (comment) |
OK cool!
Great. Thanks for the effort, let's merge these 😃 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
In the next couple of days, I'll add a small doc with link to some projects and issues to track to achieve this. |
@barryib, interested in learning about your findings and what's the best and cleanest way to achieve worker upgrades. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove stale |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity since being marked as stale. |
Closing since the instance_refresh is the recommended way to do this for self-managed worker groups. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
I have issues
This module is great for deploying EKS clusters, but it has taken the decision to leave the worker upgrade out of its scope. This is ok for certain users, but for us, dealing manually with worker upgrade is a painful and repetitive task, mostly when you have a lot of workers.
This issue, is more like a discussion to decide if we want to implement this and how we should handle it.
I'm submitting a...
Are you able to fix this problem and submit a PR? Link here if you have already.
Yes, I'll be happy to submit PRs for this. But before that, I want to know what direction I (we) should take for this.
To handle this, I would like to use autocaling group lifecycle hooks to drain nodes during scale in. I want to use a lambda function which will subscribe to
autoscaling:EC2_INSTANCE_TERMINATING
events and drain nodes before ASG terminates EC2 instances.There is already a good proof of concept in aws-samples, called amazon k8s node drainer.
By using ASG lifecycle hooks, we can achieve what @max-rocket-internet proposed #412 (comment).
And by using both hooks and cloud formation, we can tackle #333 (comment).
So, my point is NOT to handle all these with this module, but I think it should allow users to decide whether or not to scale in nodes after an LT change and let them handle node draining.
My questions here are :
worker_recreate_asg_when_lt_changes
to let terraform recreate the ASG when theaws_launch_template
changes ? This will only prefix the ASG name with LT name.Any other relevant info
here are additional links for ASG lifecycle hooks :
The text was updated successfully, but these errors were encountered: