Support Upgrade Existing EKS Kubernetes #348

christopherhein · 2018-12-17T10:17:16Z

Why do you want this feature?
EKS currently has clusters running with 1.10 versions, this would add a new mechanism to upgrade existing clusters to 1.11.

What feature/behavior/change do you want?
I'd like to have the conversation about best practices for how we should allow for this?

Should we upgrade the nodes automatically?
How should we do this to reduce human error? Interactively Upgrade EKS Worker Node aws/containers-roadmap#57

This is the extension from #344

mrichman · 2018-12-17T20:50:34Z

This would be awesome because this sucks.

errordeveloper · 2018-12-29T09:04:58Z

Let's write down semi-manual instructions first (see #357 (comment)), it should become clear what needs automating from there.

cc @tiffanyfay

christopherhein · 2019-01-11T01:37:00Z

mrichman · 2019-01-11T02:35:02Z

If going from 1.10 to 1.11 then also swap kube-dns for CoreDNS.

christopherhein · 2019-01-11T03:46:19Z

Good point @mrichman.

errordeveloper · 2019-01-11T11:12:05Z

What @tiffanyfay <https://github.com/tiffanyfay> and I have: 1. Create new node group eksctl create nodegroup 2. Add new sg as ingress for old sg and old sg to new sg as ingress

What is this supposed to accomplish?

3. check if cluster-autoscaler is installed if so scale down to 0 4. scale kube-dns by 1 5. taint all old nodes kubectl taint nodes node_name key=value:NoSchedule

Not sure this is really needed, drain accomplishes this as far as I know.

6. drain all nodes kubectl drain node_name --ignore-daemonsets --delete-local-data

Why is '--ignore-daemonsets' needed here?

7. once all nodes are drained remove added sg ingress 8. delete old node group and remove IAM role from the aws-auth configmap 9. if cluster-autoscaler is installed if so scale back to original

By the way, does it work with multiple ASGs?

10. scale kube-dns down by 1 We might also have to upgrade kube-proxy from 1.10 to 1.11. Need more info.

As Mark mentiined, there is going to be a flip to coredns, is there some kind of official EKS method for this?

tiffanyfay · 2019-01-11T20:20:03Z

@errordeveloper for the 1.11 upgrade, I don't believe so. I'll talk with the team.

And if/when we are good with the steps, I'll work on an update API/command when I'm back to work next week.

tiffanyfay · 2019-01-11T20:26:31Z

We also need to update kube-proxy in the list above.

https://docs.aws.amazon.com/eks/latest/userguide/coredns.html

errordeveloper · 2019-01-14T23:29:29Z

Answering my own questions.

Why is '--ignore-daemonsets' needed here?

So one cannot normally delete deamonset-owned pods. I still don't get why, but anyway...

By the way, does it work with multiple ASGs?

Yes, cluster autoscaler is capable of discovering nodegroups.

errordeveloper · 2019-01-14T23:29:33Z

I am still not clear on why we need to wire up a temporary SG? And what does key=value:NoSchedule that cordon/drain doesn't accomplish already?

christopherhein · 2019-01-14T23:36:18Z

Answering my own questions.

Why is '--ignore-daemonsets' needed here?

So one cannot normally delete deamonset-owned pods. I still don't get why, but anyway...

By the way, does it work with multiple ASGs?

Yes, cluster autoscaler is capable of discovering nodegroups.

Yeah, the --ignore-daemonsets is necessary or kubectl won't work to drain, didn't look into the full background for why. The reality is it doesn't matter for DS' cause as your new ASG came up and was available the DS' would have been schedule automatically.

I am still not clear on why we need to wire up a temporary SG? And what does key=value:NoSchedule that cordon/drain doesn't accomplish already?

The temp SG connection between the two ASGs allows cross service traffic while you drain nodes. So if you have pods running on both sets of ASGs and a service on the new ASG tries to route to a pod running on the old ASG it can still make the connection during the switch.

The Cordon/Drain vs NoSchedule is very nuanced. If you Cordon it will start to remove the pods from Services so doing this takes down your environment if you haven't already moved the workloads manually somehow. So instead we just NoSchedule to allow the new nodes to be the only nodes Schedule-able, then drain which will move them to the new instances.

Make sense?

errordeveloper · 2019-01-15T00:37:14Z

Thanks, Chris! Do we strictly need the temporary SG? At the moment we are still debating what level of isolation nodegroup should have (see #419), but I think it there is no isolation (for ordinary ports), we don't need the temporary SG, unless I am missing something?

errordeveloper · 2019-01-15T10:09:51Z

A short summary on #419 - I'm going to work on adding shared SG for all nodes, so that all node groups are actually equal, there will be options to enable isolation for those who need it. Adding this SG also means that we will have to add plumbing/mechanics for making changes to cluster stack, which will help for future work on upgrades in general.

errordeveloper · 2019-02-28T15:31:42Z

We should turn #348 (comment) into an actual proposal and write down basic CLI design. I think we are pretty close to having this implemented.

christopherhein · 2019-03-28T18:45:23Z

@errordeveloper would you call this done? I think we should close.

errordeveloper · 2019-03-28T19:35:21Z

Yes, I think it is!

Update readme

This was referenced Dec 22, 2018

drain nodegroup on deletion #370

Closed

How to upgrade existing worker node group? #357

Closed

christopherhein mentioned this issue Jan 22, 2019

[EKS] Cloudformation support for cluster upgrades aws/containers-roadmap#115

Closed

tiffanyfay mentioned this issue Feb 9, 2019

Cfgmap ng delete #528

Merged

6 tasks

errordeveloper added the area/upgrades label Feb 13, 2019

christopherhein changed the title ~~Support Upgrade Existing EKS Kubernetes 1.10 -> 1.11~~ Support Upgrade Existing EKS Kubernetes Feb 21, 2019

kingdonb mentioned this issue Feb 27, 2019

Update cluster using CLI #323

Closed

errordeveloper closed this as completed Mar 28, 2019

tahirali-endurance mentioned this issue Apr 2, 2020

Applying security patches on unmanaged nodes #2001

Closed

torredil pushed a commit to torredil/eksctl that referenced this issue May 20, 2022

Merge pull request eksctl-io#348 from leakingtapan/readme

ae028c8

Update readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Upgrade Existing EKS Kubernetes #348

Support Upgrade Existing EKS Kubernetes #348

christopherhein commented Dec 17, 2018

mrichman commented Dec 17, 2018

errordeveloper commented Dec 29, 2018 •

edited

Loading

christopherhein commented Jan 11, 2019 •

edited by errordeveloper

Loading

mrichman commented Jan 11, 2019

christopherhein commented Jan 11, 2019

errordeveloper commented Jan 11, 2019 via email •

edited

Loading

tiffanyfay commented Jan 11, 2019

tiffanyfay commented Jan 11, 2019

errordeveloper commented Jan 14, 2019

errordeveloper commented Jan 14, 2019 •

edited

Loading

christopherhein commented Jan 14, 2019 •

edited

Loading

errordeveloper commented Jan 15, 2019

errordeveloper commented Jan 15, 2019 •

edited

Loading

errordeveloper commented Feb 28, 2019

christopherhein commented Mar 28, 2019

errordeveloper commented Mar 28, 2019

Support Upgrade Existing EKS Kubernetes #348

Support Upgrade Existing EKS Kubernetes #348

Comments

christopherhein commented Dec 17, 2018

mrichman commented Dec 17, 2018

errordeveloper commented Dec 29, 2018 • edited Loading

christopherhein commented Jan 11, 2019 • edited by errordeveloper Loading

mrichman commented Jan 11, 2019

christopherhein commented Jan 11, 2019

errordeveloper commented Jan 11, 2019 via email • edited Loading

tiffanyfay commented Jan 11, 2019

tiffanyfay commented Jan 11, 2019

errordeveloper commented Jan 14, 2019

errordeveloper commented Jan 14, 2019 • edited Loading

christopherhein commented Jan 14, 2019 • edited Loading

errordeveloper commented Jan 15, 2019

errordeveloper commented Jan 15, 2019 • edited Loading

errordeveloper commented Feb 28, 2019

christopherhein commented Mar 28, 2019

errordeveloper commented Mar 28, 2019

errordeveloper commented Dec 29, 2018 •

edited

Loading

christopherhein commented Jan 11, 2019 •

edited by errordeveloper

Loading

errordeveloper commented Jan 11, 2019 via email •

edited

Loading

errordeveloper commented Jan 14, 2019 •

edited

Loading

christopherhein commented Jan 14, 2019 •

edited

Loading

errordeveloper commented Jan 15, 2019 •

edited

Loading