The max pods bootstrap logic is incorrect especially when prefixes are used #782

stevehipwell · 2021-10-12T14:04:47Z

What happened:
When bootstrapping a new node, the memory reservation calculation uses the values in /etc/eks/eni-max-pods.txt without any option to customize this for the new prefix mode.

This behaviour has always been wrong as there is the USE_MAX_PODS variable to disable using the values in /etc/eks/eni-max-pods.txt as the node max pod limit but which has no effect on the memory reservation. With the prefix logic the current behaviour isn't even close enough to be safe; e.g. a m5.large instance couldn't have more than 29 pods without prefixes so a custom max limit would be lower, with prefixes the recommended max pods would be 110 so required reserves would need to be significantly higher and this logic would be well off.

What you expected to happen:
The memory reservation should respect the USE_MAX_PODS variable. A new variable should be added with a new lookup file to support prefix mode. I'd suggest MAX_PODS_MODE and defaulting to eni with prefix being an option. it would be even better if MAX_PODS could be added to support custom max pods with the correct resource requests.

How to reproduce it (as minimally and precisely as possible):
n/a

Anything else we need to know?:
I'd be happy to open a PR to fix this if it would actually be reviewed.

Environment:
n/a

The text was updated successfully, but these errors were encountered:

stevehipwell · 2021-10-13T10:33:55Z

I've gone over the GKE node reservation guide (which EKS uses for CPU but not for memory) and the K8s large clusters guide which leaves me with the following question/statement.

If EKS followed the K8s guidance of using a maximum of 110 pods per node then the same default calculation as used by GKE could easily be implemented in bootstrap.sh (even with 250 max pods this algorithm is better that the current EKS 11 * max_pods + 255 one, see below).

Here is the potential memory reservations with the existing ENI mode, ENI mode modified for prefixes and the GCP algorithm.

Instance	Memory (GiB)	ENI Reserved (GiB)	Prefix Reserved (GiB)	GCP Algorithm Reserved (GiB)
t3.medium	4.00	0.43	1.43	1.00
m5.large	8.00	0.56	1.43	1.80
m5.xlarge	16.00	0.87	1.43	2.60
m5.2xlarge	32.00	0.87	2.98	3.56
m5.4xlarge	64.00	2.76	2.98	5.48

stevehipwell · 2021-10-13T18:53:37Z

For anyone wanting to solve this, the --kube-reserved kubelet flag will override the default values that bootstrap.sh sets in kubelet-config.json.

stevehipwell · 2022-01-18T10:18:31Z

@abeer91 could you or someone else working on this project please respond?

suket22 · 2022-02-02T19:28:41Z

Hey Steve, apologies for not responding earlier.

This behaviour has always been wrong as there is the USE_MAX_PODS variable to disable using the values in /etc/eks/eni-max-pods.txt as the node max pod limit but which has no effect on the memory reservation.

We chose the GKS approach earlier but had got a lot of feedback that this increased the number of instances needed to run the same workloads. This prompted us to come up with the existing formula to calculate kubeReserved (11 * NumPods + 255), on the basis of a bunch of tests here. I think the crux of the issue is that we previously thought of kubeReserved as a function of the number of pods running on the instance rather than it being a function of the cpu or available memory on the instance. With PrefixDelegation and IPV6, the number of pods running on a node is no longer a function of the number of ENIs / secondary IPv4s we can assign to the node so we should've modeled kubeReserved as a function of mem / cpu as well.

Redoing our kubeReserved calculation is something we're looking to do. We've not gotten around to it yet and it does need a lot of testing since the blast radius of messing that up is quite high.

You're right in that there's always been this bug, where we always choose maxPods from the file and never any maxPods being passed into the bootstrap script using the USE_MAX_PODS + kubeletExtraArgs flag etc. While this logic doesn't really make sense and needs to be rewritten entirely IMO, the reason we've left it alone is because we're getting some form of kubeReserved laddering with that. Smaller instance types had smaller possible assigned IPs (based on the values in the file) and therefore lesser amount of kubeReserved.

I've not heard of a lot of issues with our current kubeReserved numbers being bad - if there were specific workloads you're running where the kubeReserved values (with 110 pods) isn't good, that would be helpful data.

with prefixes the recommended max pods would be 110 so required reserves would need to be significantly higher and this logic would be well off.

I don't have results handy, but we did try simulating some pod churn on a m5.xlarge and I didn't see the kubelet struggle to keep up. This makes me think the numbers we have in place with the 11 * NumPods + 255 formula might actually err on the side of reserving more memory than necessary, rather than it being too small.

The upstream number of 110 was chosen a long time ago. I believe the tests were performed on an instance similar to t3.large and essentially static stability (based on the number of pods) and a constant rate of pod churn was tested. There have probably been lots of performance improvements made to K8s and container-runtime since then so I wouldn't rely on 110 as being this magic number beyond which things will start to fail. @shyamjvs probably knows more about the genesis of 110. That's also the reason why we chose 250 when vCPU > 30. Both 30 and 250 were somewhat arbitrary but based on some initial testing we did find large instances do well with 250 pods. Keep in mind when vCPU is > 30, the number of pods in that file is likely to be 737 so we've got a lot of kubeReserved head room.

stevehipwell · 2022-02-02T19:33:30Z

@suket22 I've been working round this for a while now and setting max pods to 110 and overwriting the kube reserved based on the GKE algorithm and this has made a noticeable improvement in node stability. This is relevant to IPv4 prefix mode as well as IPv6.

Are there any plans to add some better defaults? Would you like a PR?

suket22 · 2022-02-02T19:51:17Z

Sorry, I hit enter too early on my previous post. We do intend to have better defaults but there's no concrete timeline I can give. It's unlikely that our defaults are going to satisfy all use cases, so I'm not confident you'll be able to get rid of the kubeReserved override.

stevehipwell · 2022-02-02T20:04:21Z

Adding better inputs to bootstrap.sh to override the defaults would be a good start.

tooptoop4 · 2024-09-05T10:57:14Z

This was referenced Oct 12, 2021

[EKS] Increased pod density on smaller instance types aws/containers-roadmap#138

Closed

Cannot Create a New Cluster w/ 17.21.0 terraform-aws-modules/terraform-aws-eks#1635

Closed

stevehipwell mentioned this issue Jan 18, 2022

I can't create eks_managed_node_groups with bootstrap_extra_args terraform-aws-modules/terraform-aws-eks#1770

Closed

stevehipwell mentioned this issue Jan 18, 2022

Decouple MaxPods from CNI Logic bottlerocket-os/bottlerocket#1721

Open

This was referenced Mar 9, 2022

Why only kubereserved ? #880

Closed

create eni-max-pods.txt file in generated LT aws/karpenter-provider-aws#1403

Closed

[WIP]: use a static vm overhead for memory rather than 7.5% aws/karpenter-provider-aws#1330

Closed

suket22 mentioned this issue Mar 10, 2022

Add support for limiting max pods per node to 110 aws/karpenter-provider-aws#1490

Closed

suket22 mentioned this issue Jun 21, 2022

fix: consider pod density in memory overhead calc aws/karpenter-provider-aws#1960

Merged

3 tasks

stevehipwell mentioned this issue Jun 23, 2022

Max Pods Calculator Kaizen #953

Open

shyamjvs mentioned this issue Dec 31, 2022

Revisit kube-reserved calculation for containerd #1141

Open

stevehipwell mentioned this issue Jan 25, 2023

EKS nodes lose readiness when containers exhaust memory #1145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The max pods bootstrap logic is incorrect especially when prefixes are used #782

The max pods bootstrap logic is incorrect especially when prefixes are used #782

stevehipwell commented Oct 12, 2021

stevehipwell commented Oct 13, 2021

stevehipwell commented Oct 13, 2021

stevehipwell commented Jan 18, 2022

suket22 commented Feb 2, 2022 •

edited

Loading

stevehipwell commented Feb 2, 2022

suket22 commented Feb 2, 2022

stevehipwell commented Feb 2, 2022

tooptoop4 commented Sep 5, 2024

The max pods bootstrap logic is incorrect especially when prefixes are used #782

The max pods bootstrap logic is incorrect especially when prefixes are used #782

Comments

stevehipwell commented Oct 12, 2021

stevehipwell commented Oct 13, 2021

stevehipwell commented Oct 13, 2021

stevehipwell commented Jan 18, 2022

suket22 commented Feb 2, 2022 • edited Loading

stevehipwell commented Feb 2, 2022

suket22 commented Feb 2, 2022

stevehipwell commented Feb 2, 2022

tooptoop4 commented Sep 5, 2024

suket22 commented Feb 2, 2022 •

edited

Loading