-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Upgrade Existing EKS Kubernetes #348
Comments
This would be awesome because this sucks. |
Let's write down semi-manual instructions first (see #357 (comment)), it should become clear what needs automating from there. cc @tiffanyfay |
What @tiffanyfay and I have:
We might also have to upgrade kube-proxy from 1.10 to 1.11. Need more info. |
If going from 1.10 to 1.11 then also swap kube-dns for CoreDNS. |
Good point @mrichman. |
What @tiffanyfay <https://github.com/tiffanyfay> and I have:
1. Create new node group eksctl create nodegroup
2. Add new sg as ingress for old sg and old sg to new sg as ingress
What is this supposed to accomplish?
3. check if cluster-autoscaler is installed if so scale down to 0
4. scale kube-dns by 1
5. taint all old nodes kubectl taint nodes node_name
key=value:NoSchedule
Not sure this is really needed, drain accomplishes this as far as I know.
6. drain all nodes kubectl drain node_name --ignore-daemonsets
--delete-local-data
Why is '--ignore-daemonsets' needed here?
7. once all nodes are drained remove added sg ingress
8. delete old node group and remove IAM role from the aws-auth
configmap
9. if cluster-autoscaler is installed if so scale back to original
By the way, does it work with multiple ASGs?
10. scale kube-dns down by 1
We might also have to upgrade kube-proxy from 1.10 to 1.11. Need more info.
As Mark mentiined, there is going to be a flip to coredns, is there some
kind of official EKS method for this?
|
@errordeveloper for the 1.11 upgrade, I don't believe so. I'll talk with the team. And if/when we are good with the steps, I'll work on an update API/command when I'm back to work next week. |
We also need to update kube-proxy in the list above. https://docs.aws.amazon.com/eks/latest/userguide/coredns.html |
Answering my own questions.
So one cannot normally delete deamonset-owned pods. I still don't get why, but anyway...
Yes, cluster autoscaler is capable of discovering nodegroups. |
I am still not clear on why we need to wire up a temporary SG? And what does |
Yeah, the
The temp SG connection between the two ASGs allows cross service traffic while you drain nodes. So if you have pods running on both sets of ASGs and a service on the new ASG tries to route to a pod running on the old ASG it can still make the connection during the switch. The Cordon/Drain vs NoSchedule is very nuanced. If you Cordon it will start to remove the pods from Services so doing this takes down your environment if you haven't already moved the workloads manually somehow. So instead we just Make sense? |
Thanks, Chris! Do we strictly need the temporary SG? At the moment we are still debating what level of isolation nodegroup should have (see #419), but I think it there is no isolation (for ordinary ports), we don't need the temporary SG, unless I am missing something? |
A short summary on #419 - I'm going to work on adding shared SG for all nodes, so that all node groups are actually equal, there will be options to enable isolation for those who need it. Adding this SG also means that we will have to add plumbing/mechanics for making changes to cluster stack, which will help for future work on upgrades in general. |
We should turn #348 (comment) into an actual proposal and write down basic CLI design. I think we are pretty close to having this implemented. |
@errordeveloper would you call this done? I think we should close. |
Yes, I think it is! |
Why do you want this feature?
EKS currently has clusters running with 1.10 versions, this would add a new mechanism to upgrade existing clusters to 1.11.
What feature/behavior/change do you want?
I'd like to have the conversation about best practices for how we should allow for this?
This is the extension from #344
The text was updated successfully, but these errors were encountered: