-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interactively Upgrade EKS Worker Node #57
Comments
Draining nodes before removing is important for us for this as well. We are working on improving our automation to do this but would be nice if this was part of the worker upgrade process. |
Ditto.. this is real pain point as it's just so ad-hoc/hacky really shouldn't involve interactions with cloudformation directly (that's AWS problem not mine IMO) |
@mrichman @jaredeis @nukepuppy thanks for submitting this, would you want to this to update the |
I don't have enough K8s experience to know if just updating kubelet in-place is best practice, or if draining the node off first would be better. I envision some workflow where for every node that's in the ASG, it replaces them one by one while draining off the obsolete nodes. That's how we are trying to do it right now, but our current solution using SSM and Lambda doesn't seem to work as scale (it times out and the process fails). |
I second @jaredeis comments. Whatever the best practice is. My assumption is a rolling update to preserve capacity in the ASG. |
Would be great if the nodes are drained with respect to PodDistrubtionBudgets: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#how-disruption-budgets-work I agree the current process is way to complex and would involve (a lot of) scripting to make this happening in a sub-optimal way. |
Thanks for that @mrichman and @jaredeis really helpful. If you wouldn't mind as you do go through your updates this was something that was discussed yesterday at the SIG-AWS meeting at KubeCon and brought up by @spiffxp, overall the sig-cluster-lifecycle needs help in discussing the ways people are upgrading their clusters to really get to a prescriptive approach for doing so. This means your experiences are incredibly useful, if you could document and make notes and share about what you are doing, and what is working well vs not you will not only help this proposed feature but also will help the Kubernetes community as a whole. |
Is this not what you are seeing when you |
As I understand it, drain does respect PDB so that shouldn't be an issue as long as you have proper PDBs. @christopherhein, where would you like me to send what we are trying now (that's not really working)? I would be happy to do so. |
@jaredeis Would love to review your strategy as well. Could be too noisy to post here, but perhaps a link to a Google Doc or similar? |
If you and your org is okay you could post that here, or maybe a Whatever medium it would be nice to make sure we can also get this into the hands of sig-cluster-lifecycle too. |
@christopherhein @jaredeis yes, you are correct. The drain command respects the PodDistruptionBudgets according to the docs. Would be great if that is supported with an interactive upgrade. To clarify, cordon itself doesn't replace the nodes, it marks them as unavailable for "new" pods. |
I will work up something tomorrow and post the link here. Thanks for listening to customers! |
Of course, thank you for being a part of this! |
with GKE you can just set node pool to autoupgrade to master version, which would be nice |
Do you know what strategy GKE uses to perform the upgrade? |
I'm not sure if it has a specific name, but it's a rolling update of the worker nodes, one at a time, within the pool |
We do this at the moment with a combination of:
and then:
....and then waiting for all nodes to be in the |
Maybe this will be enough info, but still short enough to fit here. Our requirements from the architect were that any change to the ASG (userdata, AMI, etc) would trigger a node drain before the existing node is removed. So one of our enginners came up with this solution:
This worked fine in testing, but now that we have devs using it and there are more active pods on the nodes, the drain times out since the lambda can only run for 15 min. We have a pause time in our ASG between each node of 15 min, but the actual drain function that the lambda runs has an 8 min time out so the whole lambda has time to run. So we can only do 4 nodes per hour, which isn't good. Obviously this is complex, and it's not working at scale anyway as some nodes are taking more than 8m to drain. I know there are probably some improvements to this design we can do, but I know there has to be a better way than this. |
I think reviewing just what a kops rolling update does https://github.com/kubernetes/kops/blob/master/docs/cli/kops_rolling-update.md In theory a button to do the rolling update for KOPS and it would mostly just work (even though there be dragons sometimes ya know)... behind the scenes of what the "button" would do in the context of EKS sounds like its already understood.. but just not wired together - so end users need to wire it up.. but probably not the best experience in end UX wise.. what seems to be desired.. some command or process issued and the cluster will be in the desired state at the end of it... however it's a bit fuzzy.. how much do we want end users to control the nodes? if a lot (for user_data scripts to do monitoring/asset tracking / system users or other processes) it becomes a double edged sword in a way. |
@jaredeis Could you use a Step Function to watch the result of the drain, instead of relying strictly on the Lambda? That way you won't have the issue with the Lambda timeout. |
@jlongtine If we decide to improve the functionality as is, then yes I was going to work on some way to make this more event based. However, I think we are going to have to rethink the whole thing and stand up another ASG and migrate to it. Just have to figure out how to automate that and make our existing pipeline to deploy EKS (and a few other components) still work idempotently. |
This is partially fulfilled by #139 |
We have implemented a kops like rolling update functionality here which works fine for us. |
Support worker node upgrades is now available through the EKS API and Management Console |
@mikestef9 Hey, why is this on "coming soon" status? :) |
@inductor our automation missed it! Moving now. |
Upgrading EKS worker nodes to the latest version (or to a specific version) should be as easy as clicking a button in the management console, or a single AWS CLI command.
For comparison, ECS offers the ability to upgrade the agent using both the management console and CLI.
The text was updated successfully, but these errors were encountered: