Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The max pods bootstrap logic is incorrect especially when prefixes are used #782

Open
stevehipwell opened this issue Oct 12, 2021 · 8 comments

Comments

@stevehipwell
Copy link
Contributor

What happened:
When bootstrapping a new node, the memory reservation calculation uses the values in /etc/eks/eni-max-pods.txt without any option to customize this for the new prefix mode.

This behaviour has always been wrong as there is the USE_MAX_PODS variable to disable using the values in /etc/eks/eni-max-pods.txt as the node max pod limit but which has no effect on the memory reservation. With the prefix logic the current behaviour isn't even close enough to be safe; e.g. a m5.large instance couldn't have more than 29 pods without prefixes so a custom max limit would be lower, with prefixes the recommended max pods would be 110 so required reserves would need to be significantly higher and this logic would be well off.

What you expected to happen:
The memory reservation should respect the USE_MAX_PODS variable. A new variable should be added with a new lookup file to support prefix mode. I'd suggest MAX_PODS_MODE and defaulting to eni with prefix being an option. it would be even better if MAX_PODS could be added to support custom max pods with the correct resource requests.

How to reproduce it (as minimally and precisely as possible):
n/a

Anything else we need to know?:
I'd be happy to open a PR to fix this if it would actually be reviewed.

Environment:
n/a

@stevehipwell
Copy link
Contributor Author

I've gone over the GKE node reservation guide (which EKS uses for CPU but not for memory) and the K8s large clusters guide which leaves me with the following question/statement.

If EKS followed the K8s guidance of using a maximum of 110 pods per node then the same default calculation as used by GKE could easily be implemented in bootstrap.sh (even with 250 max pods this algorithm is better that the current EKS 11 * max_pods + 255 one, see below).

Here is the potential memory reservations with the existing ENI mode, ENI mode modified for prefixes and the GCP algorithm.

Instance Memory (GiB) ENI Reserved (GiB) Prefix Reserved (GiB) GCP Algorithm Reserved (GiB)
t3.medium 4.00 0.43 1.43 1.00
m5.large 8.00 0.56 1.43 1.80
m5.xlarge 16.00 0.87 1.43 2.60
m5.2xlarge 32.00 0.87 2.98 3.56
m5.4xlarge 64.00 2.76 2.98 5.48

@stevehipwell
Copy link
Contributor Author

For anyone wanting to solve this, the --kube-reserved kubelet flag will override the default values that bootstrap.sh sets in kubelet-config.json.

@stevehipwell
Copy link
Contributor Author

@abeer91 could you or someone else working on this project please respond?

@suket22
Copy link
Member

suket22 commented Feb 2, 2022

Hey Steve, apologies for not responding earlier.

This behaviour has always been wrong as there is the USE_MAX_PODS variable to disable using the values in /etc/eks/eni-max-pods.txt as the node max pod limit but which has no effect on the memory reservation.

We chose the GKS approach earlier but had got a lot of feedback that this increased the number of instances needed to run the same workloads. This prompted us to come up with the existing formula to calculate kubeReserved (11 * NumPods + 255), on the basis of a bunch of tests here. I think the crux of the issue is that we previously thought of kubeReserved as a function of the number of pods running on the instance rather than it being a function of the cpu or available memory on the instance. With PrefixDelegation and IPV6, the number of pods running on a node is no longer a function of the number of ENIs / secondary IPv4s we can assign to the node so we should've modeled kubeReserved as a function of mem / cpu as well.

Redoing our kubeReserved calculation is something we're looking to do. We've not gotten around to it yet and it does need a lot of testing since the blast radius of messing that up is quite high.

You're right in that there's always been this bug, where we always choose maxPods from the file and never any maxPods being passed into the bootstrap script using the USE_MAX_PODS + kubeletExtraArgs flag etc. While this logic doesn't really make sense and needs to be rewritten entirely IMO, the reason we've left it alone is because we're getting some form of kubeReserved laddering with that. Smaller instance types had smaller possible assigned IPs (based on the values in the file) and therefore lesser amount of kubeReserved.

I've not heard of a lot of issues with our current kubeReserved numbers being bad - if there were specific workloads you're running where the kubeReserved values (with 110 pods) isn't good, that would be helpful data.

with prefixes the recommended max pods would be 110 so required reserves would need to be significantly higher and this logic would be well off.

I don't have results handy, but we did try simulating some pod churn on a m5.xlarge and I didn't see the kubelet struggle to keep up. This makes me think the numbers we have in place with the 11 * NumPods + 255 formula might actually err on the side of reserving more memory than necessary, rather than it being too small.

The upstream number of 110 was chosen a long time ago. I believe the tests were performed on an instance similar to t3.large and essentially static stability (based on the number of pods) and a constant rate of pod churn was tested. There have probably been lots of performance improvements made to K8s and container-runtime since then so I wouldn't rely on 110 as being this magic number beyond which things will start to fail. @shyamjvs probably knows more about the genesis of 110. That's also the reason why we chose 250 when vCPU > 30. Both 30 and 250 were somewhat arbitrary but based on some initial testing we did find large instances do well with 250 pods. Keep in mind when vCPU is > 30, the number of pods in that file is likely to be 737 so we've got a lot of kubeReserved head room.

@stevehipwell
Copy link
Contributor Author

@suket22 I've been working round this for a while now and setting max pods to 110 and overwriting the kube reserved based on the GKE algorithm and this has made a noticeable improvement in node stability. This is relevant to IPv4 prefix mode as well as IPv6.

Are there any plans to add some better defaults? Would you like a PR?

@suket22
Copy link
Member

suket22 commented Feb 2, 2022

Sorry, I hit enter too early on my previous post. We do intend to have better defaults but there's no concrete timeline I can give. It's unlikely that our defaults are going to satisfy all use cases, so I'm not confident you'll be able to get rid of the kubeReserved override.

@stevehipwell
Copy link
Contributor Author

Adding better inputs to bootstrap.sh to override the defaults would be a good start.

@tooptoop4
Copy link

:shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants