-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The max pods bootstrap logic is incorrect especially when prefixes are used #782
Comments
I've gone over the GKE node reservation guide (which EKS uses for CPU but not for memory) and the K8s large clusters guide which leaves me with the following question/statement. If EKS followed the K8s guidance of using a maximum of 110 pods per node then the same default calculation as used by GKE could easily be implemented in bootstrap.sh (even with 250 max pods this algorithm is better that the current EKS Here is the potential memory reservations with the existing ENI mode, ENI mode modified for prefixes and the GCP algorithm.
|
For anyone wanting to solve this, the --kube-reserved kubelet flag will override the default values that bootstrap.sh sets in kubelet-config.json. |
@abeer91 could you or someone else working on this project please respond? |
Hey Steve, apologies for not responding earlier.
We chose the GKS approach earlier but had got a lot of feedback that this increased the number of instances needed to run the same workloads. This prompted us to come up with the existing formula to calculate kubeReserved ( Redoing our kubeReserved calculation is something we're looking to do. We've not gotten around to it yet and it does need a lot of testing since the blast radius of messing that up is quite high. You're right in that there's always been this bug, where we always choose I've not heard of a lot of issues with our current kubeReserved numbers being bad - if there were specific workloads you're running where the kubeReserved values (with 110 pods) isn't good, that would be helpful data.
I don't have results handy, but we did try simulating some pod churn on a m5.xlarge and I didn't see the kubelet struggle to keep up. This makes me think the numbers we have in place with the The upstream number of 110 was chosen a long time ago. I believe the tests were performed on an instance similar to t3.large and essentially static stability (based on the number of pods) and a constant rate of pod churn was tested. There have probably been lots of performance improvements made to K8s and container-runtime since then so I wouldn't rely on 110 as being this magic number beyond which things will start to fail. @shyamjvs probably knows more about the genesis of 110. That's also the reason why we chose 250 when vCPU > 30. Both |
@suket22 I've been working round this for a while now and setting max pods to 110 and overwriting the kube reserved based on the GKE algorithm and this has made a noticeable improvement in node stability. This is relevant to IPv4 prefix mode as well as IPv6. Are there any plans to add some better defaults? Would you like a PR? |
Sorry, I hit enter too early on my previous post. We do intend to have better defaults but there's no concrete timeline I can give. It's unlikely that our defaults are going to satisfy all use cases, so I'm not confident you'll be able to get rid of the kubeReserved override. |
Adding better inputs to bootstrap.sh to override the defaults would be a good start. |
What happened:
When bootstrapping a new node, the memory reservation calculation uses the values in /etc/eks/eni-max-pods.txt without any option to customize this for the new prefix mode.
This behaviour has always been wrong as there is the
USE_MAX_PODS
variable to disable using the values in /etc/eks/eni-max-pods.txt as the node max pod limit but which has no effect on the memory reservation. With the prefix logic the current behaviour isn't even close enough to be safe; e.g. a m5.large instance couldn't have more than 29 pods without prefixes so a custom max limit would be lower, with prefixes the recommended max pods would be 110 so required reserves would need to be significantly higher and this logic would be well off.What you expected to happen:
The memory reservation should respect the
USE_MAX_PODS
variable. A new variable should be added with a new lookup file to support prefix mode. I'd suggestMAX_PODS_MODE
and defaulting toeni
withprefix
being an option. it would be even better ifMAX_PODS
could be added to support custom max pods with the correct resource requests.How to reproduce it (as minimally and precisely as possible):
n/a
Anything else we need to know?:
I'd be happy to open a PR to fix this if it would actually be reviewed.
Environment:
n/a
The text was updated successfully, but these errors were encountered: