-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing memory allocated in kubeReserved #419
Conversation
/lgtm |
I have run into this before when trying to correctly set the memory for kubeReserved. My method initially was much more manual in that I looked at a running cluster (all our node sizes were the same) and measured the memory usage of all the runtime (docker + kubelet + containerd, etc) over a certain period. I then set the memory to that and enforced that via the enforce-node-allocatable flag (so the cgroup settings were set accordingly). But what I found, and even find now when using your method, is that the podruntime.slice (which is the cgroup I set for kubeReserved) OOM shortly after start-up the first time. It almost looks like there is a burst of memory usage initially before settling off around the numbers you have above. But every now and then it does spike and causes some OOM. |
|
Running this in preproduction and nodes are failing due to OOM. Your calculation appears to be off. |
@cyrus-mc Thanks for commenting, could you provide us with more information? What memory usages are you observing for the kubelet and Docker when the OOM kill occurs? Also how many pods are you running on that instance? Note that by specifying the --kube-reserved-cgroup kubelet flag, you're choosing to enforce kube-reserved on system daemons. Meaning that when these daemons exceed their resource reservation, they will be OOM killed causing the node and all pods running on the node to become available. Currently, we choose not to enforce kube-reserved on system daemons. This allows the kubelet and Docker to exceed their resource reservation without becoming OOM killed. If the memory available on the node drops below the eviction-threshold, pods will start to be evicted from the node while the kubelet, container runtime, and other pods running on the node stay alive. |
In my case this was a t3.small and is running (only daemonsets) 7 PODs (but this also happened on pretty much every instance type I spun up). For the t3.small case I think the memory is set to 301. I did not capture the output of dmesg to see the breakdown of the memory usage. And yes, It is actually --enforce-node-allocatable that enforces kube-serverd (setting the cgroup option actually doesn't do anything without kube-reserved set in enforce). I stopped enforcing kube-reserved. |
The trade off of enforcing kube-reserved under a cgroup is a higher resource reservation to avoid the system daemons from ever exceeding their resource reservation and being OOM killed. The approach we have been following is not enforcing kube-reserved as a part of --enforce-node-allocatable. This allows the system daemons to temporarily exceed their reservation while also allowing us to set values for kube-reserved closer the average resource usage, rather than their maximum usage. We've followed this approach because it prevents the kubelet and container runtime from being killed when the node has more available resources. I believe most customers prefer a lower reservation on kube-reserved without enforcing it to be able to run more pods on a given worker node, reducing the number of nodes needed to support their workload and lowering costs. I'm waiting for more feedback from the team to determine we want to keep following this approach, or if we want to set a higher reservation that allows kube-reserved to be enforced. |
Hmm, so looking at the log line:
It seems like go likes to over-allocate a lot when it starts up, a total-vm of 856,872 kB with only 49,600 kB anon-rss. That much allocated on a t3.small with 7 pods sounds like a lot. I wonder if there is some way to limit using |
@natherz97 Thanks for running those detailed tests - really nice work! As we were discussing offline earlier, there are couple things besides #pods that can affect mem usage:
I won't block this PR on the above, but would suggest understanding the behavior in above scenarios so we can be more confidant with this change. I understand this is only setting kube-reserved and not enforcing it, but there is still some risk associated when lowering that because more/bigger pods could be scheduled on the node due to lowering kube-reserved and that can starve these daemons (ending up in a situation similar to not having kube-reserved at all). |
@shyamjvs Thanks for the suggestions. I modified the upstream kubelet_perf test to run with chaoskube killing a random pod in the test namespace every one second instead of two and to run test pods with two containers instead of one. I ran this test against a cluster with three worker nodes. Results for Kubelet and container runtime memory_rss usage in MiB: Pod churn: kill a random pod every 2s Pod churn: kill a random pod every 1s It appears doubling the pod churn as well as number of containers running in the e2e test doesn't significantly increase the memory_rss required by Kubernetes system daemons. |
Interesting and surprising that doubling pod churn and #containers has almost no affect on the memory usage. Are you reporting average mem usage (across the test run) or peak mem usage? Averaging might hide the effect of this increase. If so, looking at P99 or P90 mem usage might help. |
Nathan - An experiment with a single node (for deterministic results) where you change each of pod churn and #containers (though one at a time, not both at once) can help understand how those affect memory usage each. Can you post the results of that experiment below? So we can make a more informed choice. |
@shyamjvs Here's the requested data. The results suggest we should be accounting for pods with more than one container. I'll collect more data on the resource usage of worker nodes of different instance types that are running pods with three containers. Results for Kubelet and container runtime memory_rss usage in MiB: Control
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 223.82 + 89.53 = 313.35 MiB Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 338 MiB. High Pod Churn
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 215.18 + 106.36 = 321.54 MiB Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 355 MiB. Increased Containers per Pod
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 493.85 + 143.54 = 637.39 MiB Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 660 MiB. |
Thanks for carrying out those experiments @natherz97! Those are some interesting results. So it seems like #containers can significantly effect mem usage, while pod churn doesn't seem to affect too much. From my experience, almost all pods I've seen running in production have <= 2 containers per pod. Rarely 3 containers and very rarely 4 containers. Not sure if there any studies giving this data (maybe we can obtain that?), but for a start I think we can assume 2 containers for now, to not eat up too much for kube-reserved. Can you check what's the RSS mem usage with 2 containers? |
Tests against container type: Note that kubelet and Docker, among other system processes, run under system.slice CGroup: $ systemctl status 3407 | grep CGroup 1 container per pod: 2 containers per pod: 3 containers per pod: Conclusion: The container type used in pods does not have a significant affect on Kubernetes system daemon memory usage. As a result, system daemon memory usage on a worker node is a function of the number of pods and the number of containers per pod. Modifying previous experiment for a more realistic workload: Note that anyone could arbitrarily increase Kubernetes system daemon to exceed the memory available on any node by running the maximum number of pods per node with a large number of containers per pod (>= 4). We’re assuming most pod specs will not exceed three containers. Data: t3.medium (2 vCPU, 4 GiB, 17 pods): 138.52 + 69.21 = 207.73 MiB
t3.large (2 vCPU, 8 GiB, 35 pods): 240.78 + 51.77 = 292.55 MiB
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 369.96 + 86.16 = 456.12 MiB
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 112.23 + 372.78 = 485.01 MiB
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 1453.93 + 185.07 = 1639.00 MiB
m5.24xlarge (96 vCPU, 384 GiB, Supports 737 pods but ran with 500 pods): 4907.71 + 617.04 = 5524.75 MiB
Formula: Example KubeReservedMemory Reservations: I have updated the PR to use this formula. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome testing @natherz97! :)
Indeed, very exhaustive testing Nathan. Good work! /lgtm |
@natherz97 huge thanks for that work! Can you please when the new ami will be published, we're spending a lot more with current setup due to giant unused headroom per node. |
@ksemaev There is a release going out today which doesn't have this commit but the subsequent release, which should be within a couple weeks, will have it. |
* Moving log collector script to Amazon eks ami repo (awslabs#243) * Moving log collector script to this repo * Added changes according to 1.4.1 * Update eks-log-collector.sh URL on readme The instructions on readme were still pointing at the original repository. Updating to reflect the new location * remove kubectl dependency (awslabs#295) * Added CHANGELOG for v20190701 * Install ec2-instance-connect * refactor packer variables * Add c5.12xlarge and c5.24xlarge instances * Add new m5 and r5 instances * Fix t3a.small limit * add support for ap-east-1 region (awslabs#305) * 2107 allow private ssh when building (awslabs#303) * added a set of variables to allow private ssh to non-default vpc * make filepaths of ./files/ and install-worker relative to packer template dir * updated ami_description to a variable * change the amiName pattern to use minor version (awslabs#307) * update S3_URL_BASE environment variable in install-worker.sh * v20190814 release (awslabs#316) * Update list of instance types (awslabs#320) * Add all new instance types already added to the CNI * Add support for the u-*tb1.metal instances (Fix awslabs#319) * add support for me-south-1 region (awslabs#322) * Adding new directory and file for 1.14 and above by removing --allow-privileged=true flag (awslabs#327) * Add Change log for AMI Release v20190906 (awslabs#329) * sync nodegroup template to latest available (awslabs#335) * sync eks node group template to be latest available 1. add support to use ssm parameter for amiID 2. add support for all instance types supported by cni 3. formatted with rain(https://github.com/aws-cloudformation/rain) * add new CFN version 2019-09-17 * Add support for g4 instance family * Add G4DN instance family to node group template * Add change log for AMI Release v20190927 (awslabs#345) * Add 1.14 to the EKS Makefile and update older versions (awslabs#336) Add 1.14 to the list of Makefile targets. Remove 1.10 as it's no longer a supported version Update versions and build dates for older EKS versions * Add support for m5n/m5dn/r5n/r5dn instances * Remove snowflake for kubelet secret-polling config (awslabs#352) * Set a minimum evictionHard and kubeReserved * Output the autoscaling group name This name of the AutoScaling Group is useful for things like the Cluster Autoscaler so that it can manage automatic cluster scaling. * awslabs#361 - custom pause container image support (awslabs#362) * awslabs#361 - custom pause container image support * Set kubeReserved dynamically and evictionHard statically (awslabs#367) * Updating Docker version (awslabs#373) * Remove the ec2-net-utils package (awslabs#368) * Remove the ec2-net-utils package * Add code comment to describe the ec2-net-utils change * Make 'kube-bench' happy. Signed-off-by: Bruno Miguel Custódio <[email protected]> * add support for c5d.12x/c5d.24x/c5d.metal * Adding new instance types (m6g) (awslabs#378) * Revert "Make 'kube-bench' happy." since there are changes being concerned (awslabs#381) This reverts commit 593691e. * Fixed setting of DNS_CLUSTER_IP in bootstrap.sh (awslabs#226) * Replaced API calls for deciding DNS_CLUSTER_IP with arg * Bypass the metadata calls to avoid 404 errors * Fall back to MAC logic if --dns-cluster-ip is absent * Updated comment for --dns-cluster-ip * Support docker-in-docker by only returning the oldest dockerd process * TLS Ciphersuite: restrict to TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 See section 2.1.14 of the CIS benchmark: > [2.1.14] Ensure that the Kubelet only makes use of Strong Cryptographic Ciphers > If using a Kubelet config file, edit the file to set TLSCipherSuites: to TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 > If using executable arguments, edit the kubelet service file /etc/systemd/system/kubelet.service on each worker node and set the below parameter. > --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 Note that this is a regression, this had been set previously in PR awslabs#276 but got lost in awslabs#352. * Script for collecting window and ubuntu worker logs (awslabs#354) * Script for collecting window worker logs * Ubuntu support and directory re-org * Collect files from EKS logs folder * Updates to kubelet svc and kubeconfig * Updated Readme for Windows * add ability to specify aws_region & binary_bucket_region & source_ami_owners (awslabs#396) * adding support for china regions (awslabs#398) * kubelet.service should wait for iptables lock (awslabs#401) This commit makes kubelet.service wait up to 5 seconds for an iptables lock in the `ExecStartPre` step, instead of failing immediately if something else is holding the lock. * fix tls suit to be recommended by cis bench (awslabs#403) * Fix retries in bootstrap.sh If `aws eks describe-cluster` fails the first time, the retries never work because the `rc` value is never able to be set back to zero * update binaries to use latest ones (awslabs#408) * validate_yum (awslabs#411) * add ability to use precreated security group (awslabs#412) * add scripts folder (awslabs#413) * Remove invalid target 1.11 (awslabs#421) Currently, AWS EKS is no longer support Kubernetes 1.11 * Update install-worker.sh and eks-worker-al2.json (awslabs#402) * Update install-worker.sh and eks-worker-al2.json * Update kubelet.service * added ability to share amis in builder * added ability to share amis in builder * Rebasing from master * Added remote_folder to cleanup_additional_repos.sh provisioner * Added remote_folder to install_additional_repos.sh provisioner * Added remote_folder to validate.sh provisioner * adding support for 1.14 and updating cni * cni to v0.6.0 back - as v.0.7.1 has no binary * reverting back to plugins version * Remove mutating calls and ignore collection of unknown logs * Added 1.15 support and removed --allow-privileged flag from all EKS supported versions (1.12+). (awslabs#428) * Fix URL for 1.15 binaries (awslabs#429) * Fixed amazon-eks-nodegroup.yaml lint issues * Consistent Docker GID version in Image (awslabs#430) * Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install * Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install Co-authored-by: Janis Orlovs <[email protected]> * Move compressed file to /var/log (awslabs#436) * Force create the group id (awslabs#437) "the -f is force, -o is overwrite, meaning if there is an existing group with number 1950, it will create a new one with the name docker" * Fix useradd to run with privileges * Removing dependency on Authenticator binary (awslabs#440) * Reducing memory allocated in kubeReserved (awslabs#419) * Revert "Removing dependency on Authenticator binary (awslabs#440)" (awslabs#446) This reverts commit 4e0e916. * Adding support to upgrade kernel while building AMI (awslabs#447) * fix(amazon-eks-nodegroup): add ec2 service principals for isolated regions * Add inf1 instance family in EKS AMI packer configuration * Removed AssociatePublicIpAddress setting from NodeLaunchCongig and added NodeSecurityGroup dependency to SG Ingress/Egress (awslabs#450) Co-authored-by: Vishal Gupta <[email protected]> * added 1.15 * updated Jenkinsfile * updated kubelet latest from main source * typo * make kubelet service matches original master branch * Makefile updated * updated a few more * newline - yea newline * revert back to 1.15.10 * updated install-worker.sh Co-authored-by: Nithish <[email protected]> Co-authored-by: Hugo Ribeiro <[email protected]> Co-authored-by: M00nF1sh <[email protected]> Co-authored-by: Micah Hausler <[email protected]> Co-authored-by: Matthew Wong <[email protected]> Co-authored-by: Claes Mogren <[email protected]> Co-authored-by: wong yan yee <[email protected]> Co-authored-by: blakeroberts-wk <[email protected]> Co-authored-by: josselin-c <[email protected]> Co-authored-by: Bhagwat kumar Singh <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Will Thames <[email protected]> Co-authored-by: Shyam JVS <[email protected]> Co-authored-by: Dwayne Bailey <[email protected]> Co-authored-by: Andrew Johnstone <[email protected]> Co-authored-by: natherz97 <[email protected]> Co-authored-by: Kausheel Kumar <[email protected]> Co-authored-by: Bruno Miguel Custódio <[email protected]> Co-authored-by: ajayk <[email protected]> Co-authored-by: sramabad1 <[email protected]> Co-authored-by: Cheng Pan <[email protected]> Co-authored-by: Andrew Hemming <[email protected]> Co-authored-by: Eric Webster <[email protected]> Co-authored-by: Florent Delannoy <[email protected]> Co-authored-by: Arun Bhagyanath <[email protected]> Co-authored-by: Justin Owen <[email protected]> Co-authored-by: Aaron Ackerman <[email protected]> Co-authored-by: Tam Mach <[email protected]> Co-authored-by: zadowsmash <[email protected]> Co-authored-by: Shabir Ahmed <[email protected]> Co-authored-by: Abeer Sethi <[email protected]> Co-authored-by: Will Thames <[email protected]> Co-authored-by: Octavio Martin <[email protected]> Co-authored-by: Jānis Orlovs <[email protected]> Co-authored-by: Janis Orlovs <[email protected]> Co-authored-by: Divyesh Khandeshi <[email protected]> Co-authored-by: cmdallas <[email protected]> Co-authored-by: gaogilb <[email protected]> Co-authored-by: Vishal Gupta <[email protected]> Co-authored-by: Vishal Gupta <[email protected]>
@natherz97 Should the script account for max-pods passed to the kubelet, e.g. |
@natherz97 When is it going to be released? I am referring to https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html, the most recent eu-west-1 image for EKS 1.14 does not have that change |
@lkoniecz New AMIs with this change were released yesterday! |
@asheldon Good point, I think that's an optimization we could make for workloads with a smaller number of pods but a higher resource usage. That change could especially help on worker nodes which support 234 - 737 pods, but typically run at a lower pod density. The purpose of this PR was to come up with a suitable reservation for most workloads based on the maximum number of pods per node (and the number of containers per pod) to get an upper bound on daemon resource usage. In the future, I agree we should look for optimizations like the one you suggested. |
@natherz97 Thanks, Now our instance memory increased by a significant amount. |
* Moving log collector script to Amazon eks ami repo (awslabs#243) * Moving log collector script to this repo * Added changes according to 1.4.1 * Update eks-log-collector.sh URL on readme The instructions on readme were still pointing at the original repository. Updating to reflect the new location * remove kubectl dependency (awslabs#295) * Added CHANGELOG for v20190701 * Install ec2-instance-connect * refactor packer variables * Add c5.12xlarge and c5.24xlarge instances * Add new m5 and r5 instances * Fix t3a.small limit * add support for ap-east-1 region (awslabs#305) * 2107 allow private ssh when building (awslabs#303) * added a set of variables to allow private ssh to non-default vpc * make filepaths of ./files/ and install-worker relative to packer template dir * updated ami_description to a variable * change the amiName pattern to use minor version (awslabs#307) * update S3_URL_BASE environment variable in install-worker.sh * v20190814 release (awslabs#316) * Update list of instance types (awslabs#320) * Add all new instance types already added to the CNI * Add support for the u-*tb1.metal instances (Fix awslabs#319) * add support for me-south-1 region (awslabs#322) * Adding new directory and file for 1.14 and above by removing --allow-privileged=true flag (awslabs#327) * Add Change log for AMI Release v20190906 (awslabs#329) * sync nodegroup template to latest available (awslabs#335) * sync eks node group template to be latest available 1. add support to use ssm parameter for amiID 2. add support for all instance types supported by cni 3. formatted with rain(https://github.com/aws-cloudformation/rain) * add new CFN version 2019-09-17 * Add support for g4 instance family * Add G4DN instance family to node group template * Add change log for AMI Release v20190927 (awslabs#345) * Add 1.14 to the EKS Makefile and update older versions (awslabs#336) Add 1.14 to the list of Makefile targets. Remove 1.10 as it's no longer a supported version Update versions and build dates for older EKS versions * Add support for m5n/m5dn/r5n/r5dn instances * Remove snowflake for kubelet secret-polling config (awslabs#352) * Set a minimum evictionHard and kubeReserved * Output the autoscaling group name This name of the AutoScaling Group is useful for things like the Cluster Autoscaler so that it can manage automatic cluster scaling. * awslabs#361 - custom pause container image support (awslabs#362) * awslabs#361 - custom pause container image support * Set kubeReserved dynamically and evictionHard statically (awslabs#367) * Updating Docker version (awslabs#373) * Remove the ec2-net-utils package (awslabs#368) * Remove the ec2-net-utils package * Add code comment to describe the ec2-net-utils change * Make 'kube-bench' happy. Signed-off-by: Bruno Miguel Custódio <[email protected]> * add support for c5d.12x/c5d.24x/c5d.metal * Adding new instance types (m6g) (awslabs#378) * Revert "Make 'kube-bench' happy." since there are changes being concerned (awslabs#381) This reverts commit 593691e. * Fixed setting of DNS_CLUSTER_IP in bootstrap.sh (awslabs#226) * Replaced API calls for deciding DNS_CLUSTER_IP with arg * Bypass the metadata calls to avoid 404 errors * Fall back to MAC logic if --dns-cluster-ip is absent * Updated comment for --dns-cluster-ip * Support docker-in-docker by only returning the oldest dockerd process * TLS Ciphersuite: restrict to TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 See section 2.1.14 of the CIS benchmark: > [2.1.14] Ensure that the Kubelet only makes use of Strong Cryptographic Ciphers > If using a Kubelet config file, edit the file to set TLSCipherSuites: to TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 > If using executable arguments, edit the kubelet service file /etc/systemd/system/kubelet.service on each worker node and set the below parameter. > --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 Note that this is a regression, this had been set previously in PR awslabs#276 but got lost in awslabs#352. * Script for collecting window and ubuntu worker logs (awslabs#354) * Script for collecting window worker logs * Ubuntu support and directory re-org * Collect files from EKS logs folder * Updates to kubelet svc and kubeconfig * Updated Readme for Windows * add ability to specify aws_region & binary_bucket_region & source_ami_owners (awslabs#396) * adding support for china regions (awslabs#398) * kubelet.service should wait for iptables lock (awslabs#401) This commit makes kubelet.service wait up to 5 seconds for an iptables lock in the `ExecStartPre` step, instead of failing immediately if something else is holding the lock. * fix tls suit to be recommended by cis bench (awslabs#403) * Fix retries in bootstrap.sh If `aws eks describe-cluster` fails the first time, the retries never work because the `rc` value is never able to be set back to zero * update binaries to use latest ones (awslabs#408) * validate_yum (awslabs#411) * add ability to use precreated security group (awslabs#412) * add scripts folder (awslabs#413) * Remove invalid target 1.11 (awslabs#421) Currently, AWS EKS is no longer support Kubernetes 1.11 * Update install-worker.sh and eks-worker-al2.json (awslabs#402) * Update install-worker.sh and eks-worker-al2.json * Update kubelet.service * added ability to share amis in builder * added ability to share amis in builder * Rebasing from master * Added remote_folder to cleanup_additional_repos.sh provisioner * Added remote_folder to install_additional_repos.sh provisioner * Added remote_folder to validate.sh provisioner * Remove mutating calls and ignore collection of unknown logs * Added 1.15 support and removed --allow-privileged flag from all EKS supported versions (1.12+). (awslabs#428) * Fix URL for 1.15 binaries (awslabs#429) * Fixed amazon-eks-nodegroup.yaml lint issues * Consistent Docker GID version in Image (awslabs#430) * Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install * Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install Co-authored-by: Janis Orlovs <[email protected]> * Move compressed file to /var/log (awslabs#436) * Force create the group id (awslabs#437) "the -f is force, -o is overwrite, meaning if there is an existing group with number 1950, it will create a new one with the name docker" * Fix useradd to run with privileges * Removing dependency on Authenticator binary (awslabs#440) * Reducing memory allocated in kubeReserved (awslabs#419) * Revert "Removing dependency on Authenticator binary (awslabs#440)" (awslabs#446) This reverts commit 4e0e916. * Adding support to upgrade kernel while building AMI (awslabs#447) * fix(amazon-eks-nodegroup): add ec2 service principals for isolated regions * Add inf1 instance family in EKS AMI packer configuration * Removed AssociatePublicIpAddress setting from NodeLaunchCongig and added NodeSecurityGroup dependency to SG Ingress/Egress (awslabs#450) Co-authored-by: Vishal Gupta <[email protected]> * Add a flag that allows CNI packages to be pulled from S3 instead of Github. (awslabs#457) The default behavior is unchanged and will still pull assets from Github. * update source AMI owner and ECR repo for govcloud (awslabs#458) * updated ipamd information files extension to json (awslabs#451) * updated ipamd data file extension to json * updated ipamd metrics file extension * Adding 1.16 to Makefile (awslabs#459) * downgrade * Add a new manifest containing the AMI name (awslabs#471) This commit adds a new manifest which contains AMI name in the manifest filename so that parallel builds can be triggered. Even though the new manifest is now generated along with the current one for backwards compatibility, eventually the old manifest (manifest.json) will be deprecated. * changelog updated * added udev setting * small updates * some fix * added udev again Co-authored-by: Nithish <[email protected]> Co-authored-by: Hugo Ribeiro <[email protected]> Co-authored-by: M00nF1sh <[email protected]> Co-authored-by: Micah Hausler <[email protected]> Co-authored-by: Matthew Wong <[email protected]> Co-authored-by: Claes Mogren <[email protected]> Co-authored-by: wong yan yee <[email protected]> Co-authored-by: blakeroberts-wk <[email protected]> Co-authored-by: josselin-c <[email protected]> Co-authored-by: Bhagwat kumar Singh <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Will Thames <[email protected]> Co-authored-by: Shyam JVS <[email protected]> Co-authored-by: Dwayne Bailey <[email protected]> Co-authored-by: Andrew Johnstone <[email protected]> Co-authored-by: natherz97 <[email protected]> Co-authored-by: Kausheel Kumar <[email protected]> Co-authored-by: Bruno Miguel Custódio <[email protected]> Co-authored-by: ajayk <[email protected]> Co-authored-by: sramabad1 <[email protected]> Co-authored-by: Cheng Pan <[email protected]> Co-authored-by: Andrew Hemming <[email protected]> Co-authored-by: Eric Webster <[email protected]> Co-authored-by: Florent Delannoy <[email protected]> Co-authored-by: Arun Bhagyanath <[email protected]> Co-authored-by: Justin Owen <[email protected]> Co-authored-by: Aaron Ackerman <[email protected]> Co-authored-by: Tam Mach <[email protected]> Co-authored-by: zadowsmash <[email protected]> Co-authored-by: Abeer Sethi <[email protected]> Co-authored-by: Will Thames <[email protected]> Co-authored-by: Octavio Martin <[email protected]> Co-authored-by: Jānis Orlovs <[email protected]> Co-authored-by: Janis Orlovs <[email protected]> Co-authored-by: Divyesh Khandeshi <[email protected]> Co-authored-by: cmdallas <[email protected]> Co-authored-by: gaogilb <[email protected]> Co-authored-by: Vishal Gupta <[email protected]> Co-authored-by: Vishal Gupta <[email protected]> Co-authored-by: Bronson Mirafuentes <[email protected]> Co-authored-by: Sai Teja Penugonda <[email protected]> Co-authored-by: Shabir Ahmed <[email protected]> Co-authored-by: Saurav Agarwalla <[email protected]>
@natherz97 Are these values good for docker version 19.03. I am seeing a lot of PLEG not healthy issues with docker daemon 19.03. This issue was nonexistent in our environment. |
Overview:
Related issue: #387
This PR reduces the memory dynamically allocated in kubeReserved. The purpose of kubeReserved is to reserve resources for Kubernetes system daemons that aren’t run as pods, including the kubelet and container runtime. If resources are not reserved, the allocatable resources on worker nodes will include the resources required by these processes, resulting in pods and system daemons competing for the same resources. Note that the resources required by Kubernetes system daemons is a function of pod density on worker nodes. We will continue using our current formula for reserving CPU resources.
For more information about kube-reserved see: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
Testing:
Setup:
These tests were run against a 1.14 EKS cluster with three worker nodes of the given instance type. We used ami-05d586e6f773f6abf for the worker node AMI which doesn’t set kubeReserved. Since kubeReserved is a function of pod density, we ran the kubelet_perf e2e test (https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/kubelet_perf.go) with the maximum number of pods we support per worker node. By default, this upstream test is run with 100 pods per worker node. Our worker nodes support more or less than 100 pods depending on the maximum number of network interfaces and the number of IP addresses per network interface supported by that instance type.
To calculate the number of pods that can run on a given worker node instance type:
MaximumNetworkInterfaces * (IPPerNetworkInterface - 1) + 2
Note that we add two for the aws-node and kube-proxy pods running on all worker nodes.
For more information about IP addresses per network interface per instance type see: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html and https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt
While running the kubelet_perf e2e test, we also ran chaoskube to ensure a constant pod churn while the test was executing (https://github.com/linki/chaoskube). We configured chaoskube to kill a random pod every two seconds in the namespace where the kubelet_perf replica set was running.
Data:
To calculate the Kubernetes system daemon memory usage, we added together the RSS memory usage by the kubelet and container runtime.
Kubernetes system daemon example memory usages on different instance types:
t3.small (2 vCPU, 2 GiB, 11 pods): 150.27 Mi, 176.62 Mi, 156.34 Mi
t3.medium (2 vCPU, 4 GiB, 17 pods): 203.09 Mi, 218.98 Mi, 214.50 Mi
t3.large (2 vCPU, 8 GiB, 35 pods): 247.61 Mi, 226.05 Mi, 221.93 Mi
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 297.76 Mi, 332.58 Mi, 305.94 Mi
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 299.23 Mi, 298.33 Mi, 318.36 Mi
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 931.70 Mi, 934.83 Mi, 949.34 Mi
m4.10xlarge (40 vCPU, 160 GiB, 234 pods): 1057.43 Mi, 1057.81 Mi , 1058.28 Mi
m5.12xlarge (48 vCPU, 192 GiB, 234 pods): 1103.84 Mi, 1098.63 Mi, 1100.50 Mi
r4.8xlarge (32 vCPU, 244 GiB, 234 pods): 1004.25 Mi, 1007.63 Mi, 1028.63 Mi
m5.24xlarge (96 vCPU, 384 GiB, 737 pods): 3837.95 Mi, 3821.56 Mi, 3833.79 Mi
Note that memory usage by Kubernetes system daemons is a function of running pods, not the available memory. For example, all instances which support 234 pods used approximately 1000 Mi for Kubernetes system daemons each, even though the available memory greatly varies between these instance types.
If we plot this data using the number of pods as the x-axis and the Kubernetes system daemon memory usage in MiB as the y-axis:
We can see pod number and memory usage has a linear relationship. While slightly over-estimating the memory usage for safety, we can see this relationship can be modeled by:
MemoryToReserve = 6 * NumPods + 255
Example KubeReserved Memory Reservations:
t3.small (2 vCPU, 2 GiB, 11 pods): 321 Mi
t3.medium (2 vCPU, 4 GiB, 17 pods): 357 Mi
t3.large (2 vCPU, 8 GiB, 35 pods): 465 Mi
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 603 Mi
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 603 Mi
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 1659 Mi
m4.10xlarge (40 vCPU, 160 GiB, 234 pods): 1659 Mi
m5.12xlarge (48 vCPU, 192 GiB, 234 pods): 1659 Mi
r4.8xlarge (32 vCPU, 244 GiB, 234 pods): 1659 Mi
m5.24xlarge (96 vCPU, 384 GiB, 737 pods): 4677 Mi