Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing memory allocated in kubeReserved #419

Merged
merged 1 commit into from
Apr 8, 2020

Conversation

natherz97
Copy link
Contributor

@natherz97 natherz97 commented Feb 18, 2020

Overview:

Related issue: #387

This PR reduces the memory dynamically allocated in kubeReserved. The purpose of kubeReserved is to reserve resources for Kubernetes system daemons that aren’t run as pods, including the kubelet and container runtime. If resources are not reserved, the allocatable resources on worker nodes will include the resources required by these processes, resulting in pods and system daemons competing for the same resources. Note that the resources required by Kubernetes system daemons is a function of pod density on worker nodes. We will continue using our current formula for reserving CPU resources.

For more information about kube-reserved see: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

Testing:

Setup:
These tests were run against a 1.14 EKS cluster with three worker nodes of the given instance type. We used ami-05d586e6f773f6abf for the worker node AMI which doesn’t set kubeReserved. Since kubeReserved is a function of pod density, we ran the kubelet_perf e2e test (https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/kubelet_perf.go) with the maximum number of pods we support per worker node. By default, this upstream test is run with 100 pods per worker node. Our worker nodes support more or less than 100 pods depending on the maximum number of network interfaces and the number of IP addresses per network interface supported by that instance type.

To calculate the number of pods that can run on a given worker node instance type:
MaximumNetworkInterfaces * (IPPerNetworkInterface - 1) + 2

Note that we add two for the aws-node and kube-proxy pods running on all worker nodes.

For more information about IP addresses per network interface per instance type see: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html and https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt

While running the kubelet_perf e2e test, we also ran chaoskube to ensure a constant pod churn while the test was executing (https://github.com/linki/chaoskube). We configured chaoskube to kill a random pod every two seconds in the namespace where the kubelet_perf replica set was running.

Data:
To calculate the Kubernetes system daemon memory usage, we added together the RSS memory usage by the kubelet and container runtime.

Kubernetes system daemon example memory usages on different instance types:

t3.small (2 vCPU, 2 GiB, 11 pods): 150.27 Mi, 176.62 Mi, 156.34 Mi
t3.medium (2 vCPU, 4 GiB, 17 pods): 203.09 Mi, 218.98 Mi, 214.50 Mi
t3.large (2 vCPU, 8 GiB, 35 pods): 247.61 Mi, 226.05 Mi, 221.93 Mi
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 297.76 Mi, 332.58 Mi, 305.94 Mi
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 299.23 Mi, 298.33 Mi, 318.36 Mi
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 931.70 Mi, 934.83 Mi, 949.34 Mi
m4.10xlarge (40 vCPU, 160 GiB, 234 pods): 1057.43 Mi, 1057.81 Mi , 1058.28 Mi
m5.12xlarge (48 vCPU, 192 GiB, 234 pods): 1103.84 Mi, 1098.63 Mi, 1100.50 Mi
r4.8xlarge (32 vCPU, 244 GiB, 234 pods): 1004.25 Mi, 1007.63 Mi, 1028.63 Mi
m5.24xlarge (96 vCPU, 384 GiB, 737 pods): 3837.95 Mi, 3821.56 Mi, 3833.79 Mi

Note that memory usage by Kubernetes system daemons is a function of running pods, not the available memory. For example, all instances which support 234 pods used approximately 1000 Mi for Kubernetes system daemons each, even though the available memory greatly varies between these instance types.

If we plot this data using the number of pods as the x-axis and the Kubernetes system daemon memory usage in MiB as the y-axis:

Screen Shot 2020-02-18 at 1 35 32 PM

We can see pod number and memory usage has a linear relationship. While slightly over-estimating the memory usage for safety, we can see this relationship can be modeled by:
MemoryToReserve = 6 * NumPods + 255

Example KubeReserved Memory Reservations:

t3.small (2 vCPU, 2 GiB, 11 pods): 321 Mi
t3.medium (2 vCPU, 4 GiB, 17 pods): 357 Mi
t3.large (2 vCPU, 8 GiB, 35 pods): 465 Mi
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 603 Mi
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 603 Mi
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 1659 Mi
m4.10xlarge (40 vCPU, 160 GiB, 234 pods): 1659 Mi
m5.12xlarge (48 vCPU, 192 GiB, 234 pods): 1659 Mi
r4.8xlarge (32 vCPU, 244 GiB, 234 pods): 1659 Mi
m5.24xlarge (96 vCPU, 384 GiB, 737 pods): 4677 Mi

@wongma7
Copy link
Contributor

wongma7 commented Feb 18, 2020

/lgtm

@cyrus-mc
Copy link

cyrus-mc commented Mar 5, 2020

@natherz97

I have run into this before when trying to correctly set the memory for kubeReserved. My method initially was much more manual in that I looked at a running cluster (all our node sizes were the same) and measured the memory usage of all the runtime (docker + kubelet + containerd, etc) over a certain period.

I then set the memory to that and enforced that via the enforce-node-allocatable flag (so the cgroup settings were set accordingly).

But what I found, and even find now when using your method, is that the podruntime.slice (which is the cgroup I set for kubeReserved) OOM shortly after start-up the first time. It almost looks like there is a burst of memory usage initially before settling off around the numbers you have above. But every now and then it does spike and causes some OOM.

@cyrus-mc
Copy link

cyrus-mc commented Mar 5, 2020

@natherz97

[ 146.527166] CPU: 0 PID: 1084 Comm: kubelet Not tainted 5.4.17-200.fc31.x86_64 #1
[ 146.538479] Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
[ 146.545335] Call Trace:
[ 146.550387] dump_stack+0x66/0x90
[ 146.555825] dump_header+0x4a/0x1e2
[ 146.561318] oom_kill_process.cold+0xb/0x10
[ 146.567168] out_of_memory+0x24d/0x4a0
[ 146.572870] mem_cgroup_out_of_memory+0xba/0xd0
[ 146.578821] try_charge+0x76c/0x7f0
[ 146.584357] mem_cgroup_try_charge+0x99/0x1d0
[ 146.590302] __add_to_page_cache_locked+0x25e/0x3f0
[ 146.596489] ? memcg_drain_all_list_lrus+0x1d0/0x1d0
[ 146.602742] add_to_page_cache_lru+0x48/0xc0
[ 146.608593] iomap_readpages_actor+0x117/0x270
[ 146.614607] iomap_apply+0xc3/0x140
[ 146.620240] ? iomap_page_mkwrite_actor+0x70/0x70
[ 146.626388] iomap_readpages+0xa4/0x1a0
[ 146.632221] ? iomap_page_mkwrite_actor+0x70/0x70
[ 146.638916] read_pages+0x6b/0x1b0
[ 146.644390] __do_page_cache_readahead+0x1ba/0x1d0
[ 146.650511] filemap_fault+0x6ce/0xae0
[ 146.656287] ? __mod_lruvec_state+0x3f/0xe0
[ 146.662222] ? page_add_file_rmap+0x86/0x220
[ 146.668167] ? alloc_set_pte+0x123/0x680
[ 146.674106] ? _cond_resched+0x15/0x30
[ 146.679829] __xfs_filemap_fault+0x6d/0x200 [xfs]
[ 146.685969] __do_fault+0x36/0x100
[ 146.691468] __handle_mm_fault+0x101c/0x1590
[ 146.697963] ? __switch_to_asm+0x34/0x70
[ 146.703899] handle_mm_fault+0xc4/0x1f0
[ 146.709754] do_user_addr_fault+0x1f9/0x450
[ 146.715623] do_page_fault+0x31/0x110
[ 146.721436] async_page_fault+0x3e/0x50
[ 146.727500] RIP: 0033:0x3185c3e
[ 146.732889] Code: Bad RIP value.
[ 146.738378] RSP: 002b:000000c000c50f88 EFLAGS: 00010202
[ 146.855318] RAX: 0000000000000000 RBX: 00000000004319bb RCX: 000000c000c50000
[ 146.862466] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c00155e801
[ 146.869544] RBP: 000000c000c50fc8 R08: 0000000000000000 R09: 0000000000000000
[ 146.877085] R10: 0000000000000000 R11: 0000000000000286 R12: 0000002556940046
[ 146.884213] R13: 0000000000000023 R14: 0000000000000000 R15: 000000c000a22b58
[ 146.891738] memory: usage 310272kB, limit 310272kB, failcnt 232004
[ 146.898487] memory+swap: usage 310272kB, limit 9007199254740988kB, failcnt 0
[ 146.905585] kmem: usage 17520kB, limit 9007199254740988kB, failcnt 0
[ 146.912529] Memory cgroup stats for /podruntime.slice:
[ 146.920090] anon 298164224
file 1400832
kernel_stack 3428352
slab 9547776
sock 0
shmem 270336
file_mapped 135168
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 0
active_anon 298606592
inactive_file 446464
active_file 122880
unevictable 0
slab_reclaimable 3977216
slab_unreclaimable 5570560
pgfault 461802
pgmajfault 5511
workingset_refault 199089
workingset_activate 67419
workingset_nodereclaim 0
pgrefill 257261
pgscan 645616
pgsteal 369076
pgactivate 183183
pgdeactivate 209445
pglazyfree 15609
pglazyfreed 13332
thp_fault_alloc 0
thp_collapse_alloc 0
[ 147.084739] Tasks state (memory values in pages):
[ 147.090759] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 147.102756] [ 739] 0 739 211710 11601 417792 0 -999 dockerd
[ 147.114110] [ 794] 0 794 208506 4697 290816 0 -500 containerd
[ 147.125688] [ 1108] 0 1108 179851 909 167936 0 -999 containerd-shim
[ 147.137450] [ 1110] 0 1110 180203 970 172032 0 -999 containerd-shim
[ 147.149254] [ 1114] 0 1114 178154 989 172032 0 -999 containerd-shim
[ 147.161084] [ 1116] 0 1116 180203 965 176128 0 -999 containerd-shim
[ 147.172776] [ 1399] 0 1399 179835 1058 167936 0 -999 containerd-shim
[ 147.184515] [ 1518] 0 1518 178138 1080 167936 0 -999 containerd-shim
[ 147.196673] [ 1601] 0 1601 177786 937 159744 0 -999 containerd-shim
[ 147.208462] [ 1675] 0 1675 177786 1146 172032 0 -999 containerd-shim
[ 147.220363] [ 1943] 0 1943 175721 1054 155648 0 -999 containerd-shim
[ 147.232344] [ 1945] 0 1945 175657 1033 155648 0 -999 containerd-shim
[ 147.244214] [ 1956] 0 1956 157550 1933 163840 0 -999 runc
[ 147.256007] [ 1963] 0 1963 180433 1939 184320 0 -999 runc
[ 147.267842] [ 1997] 0 1997 176345 8294 315392 0 -999 exe
[ 147.279089] [ 1998] 0 1998 195130 8036 323584 0 -999 exe
[ 147.290410] [ 2000] 0 2000 175657 1082 159744 0 -999 containerd-shim
[ 147.302153] [ 2008] 0 2008 157902 1882 155648 0 -999 runc
[ 147.313439] [ 2035] 0 2035 139815 7895 282624 0 -999 exe
[ 147.324715] [ 2046] 0 2046 21926 2417 118784 0 -999 exe
[ 147.335970] [ 2052] 0 2052 21926 2301 110592 0 -999 exe
[ 147.347265] [ 989] 0 989 214218 12400 475136 0 -999 kubelet
[ 147.358790] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/podruntime.slice,task_memcg=/podruntime.slice/kubelet.service,task=kubelet,pid=989,uid=0
[ 147.377885] Memory cgroup out of memory: Killed process 989 (kubelet) total-vm:856872kB, anon-rss:49600kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:464kB oom_score_adj:-999
[ 147.401644] oom_reaper: reaped process 989 (kubelet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 147.436439] audit: type=1325 audit(1583448728.949:176): table=filter family=2 entries=31
[ 147.458054] audit: type=1325 audit(1583448728.964:177): table=nat family=2 entries=435
[ 147.487555] audit: type=1325 audit(1583448729.000:178): table=filter family=2 entries=31
[ 147.503396] audit: type=1325 audit(1583448729.015:179): table=nat family=2 entries=430
[ 147.754026] exe invoked oom-killer: gfp_mask=0xc40(GFP_NOFS), order=0, oom_score_adj=-999
[ 147.765432] CPU: 0 PID: 2052 Comm: exe Not tainted 5.4.17-200.fc31.x86_64 #1
[ 147.772575] Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
[ 147.779307] Call Trace:
[ 147.784632] dump_stack+0x66/0x90
[ 147.790075] dump_header+0x4a/0x1e2
[ 147.795669] oom_kill_process.cold+0xb/0x10
[ 147.801523] out_of_memory+0x24d/0x4a0
[ 147.807332] mem_cgroup_out_of_memory+0xba/0xd0
[ 147.813637] try_charge+0x76c/0x7f0
[ 147.819505] mem_cgroup_try_charge+0x99/0x1d0
[ 147.825793] __add_to_page_cache_locked+0x25e/0x3f0
[ 147.832629] ? memcg_drain_all_list_lrus+0x1d0/0x1d0
[ 147.839322] add_to_page_cache_lru+0x48/0xc0
[ 147.845439] iomap_readpages_actor+0x117/0x270
[ 147.851891] iomap_apply+0xc3/0x140
[ 147.858326] ? iomap_page_mkwrite_actor+0x70/0x70
[ 147.865307] iomap_readpages+0xa4/0x1a0
[ 147.871279] ? iomap_page_mkwrite_actor+0x70/0x70
[ 147.878356] read_pages+0x6b/0x1b0
[ 147.884526] __do_page_cache_readahead+0x1ba/0x1d0
[ 147.891188] filemap_fault+0x6ce/0xae0
[ 147.897345] ? try_to_wake_up+0x218/0x670
[ 147.903571] ? __mod_lruvec_state+0x3f/0xe0
[ 147.909887] ? page_add_file_rmap+0x86/0x220
[ 147.916572] ? alloc_set_pte+0x37a/0x680
[ 147.922744] ? _cond_resched+0x15/0x30
[ 147.928930] __xfs_filemap_fault+0x6d/0x200 [xfs]
[ 147.935458] __do_fault+0x36/0x100
[ 147.941785] __handle_mm_fault+0x101c/0x1590
[ 147.947637] ? __switch_to_asm+0x40/0x70
[ 147.953344] handle_mm_fault+0xc4/0x1f0
[ 147.959043] do_user_addr_fault+0x1f9/0x450
[ 147.965058] ? schedule+0x39/0xa0
[ 147.970558] do_page_fault+0x31/0x110
[ 147.976312] async_page_fault+0x3e/0x50
[ 147.982026] RIP: 0033:0x55d4f3688024
[ 147.987559] Code: Bad RIP value.
[ 147.992995] RSP: 002b:00007ffc83a56ab8 EFLAGS: 00010212
[ 147.999524] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 000055d4f36b45b3
[ 148.006972] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 000000c00005e848
[ 148.014376] RBP: 00007ffc83a56ac8 R08: 0000000000000000 R09: 0000000000000000
[ 148.021638] R10: 0000000000000000 R11: 0000000000000202 R12: 000000c0006f7770
[ 148.029305] R13: 0000000000000001 R14: 000055d4f5052ed0 R15: 0000000000000000
[ 148.036595] memory: usage 310272kB, limit 310272kB, failcnt 242307
[ 148.043403] memory+swap: usage 310272kB, limit 9007199254740988kB, failcnt 0
[ 148.050599] kmem: usage 16208kB, limit 9007199254740988kB, failcnt 0
[ 148.057337] Memory cgroup stats for /podruntime.slice:
[ 148.058954] anon 300027904
file 1146880
kernel_stack 3354624
slab 8724480
sock 0
shmem 270336
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 0
active_anon 300363776
inactive_file 569344
active_file 0
unevictable 0
slab_reclaimable 3567616
slab_unreclaimable 5156864
pgfault 478335
pgmajfault 6072
workingset_refault 222783
workingset_activate 68772
workingset_nodereclaim 264
pgrefill 269363
pgscan 727624
pgsteal 392893
pgactivate 192093
pgdeactivate 219744
pglazyfree 15609
pglazyfreed 13332
thp_fault_alloc 0
thp_collapse_alloc 0
[ 148.337938] Tasks state (memory values in pages):
[ 148.344041] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 148.355613] [ 739] 0 739 213759 11615 421888 0 -999 dockerd
[ 148.367200] [ 794] 0 794 208506 4698 290816 0 -500 containerd
[ 148.378778] [ 1108] 0 1108 179851 909 167936 0 -999 containerd-shim
[ 148.390527] [ 1110] 0 1110 180203 970 172032 0 -999 containerd-shim
[ 148.402966] [ 1114] 0 1114 178154 989 172032 0 -999 containerd-shim
[ 148.415051] [ 1116] 0 1116 180203 965 176128 0 -999 containerd-shim
[ 148.426867] [ 1399] 0 1399 179835 1058 167936 0 -999 containerd-shim
[ 148.438541] [ 1518] 0 1518 178138 1080 167936 0 -999 containerd-shim
[ 148.450405] [ 1601] 0 1601 177786 937 159744 0 -999 containerd-shim
[ 148.462223] [ 1675] 0 1675 177786 1146 172032 0 -999 containerd-shim
[ 148.474129] [ 1943] 0 1943 175721 1054 155648 0 -999 containerd-shim
[ 148.485753] [ 1945] 0 1945 175657 1033 155648 0 -999 containerd-shim
[ 148.497860] [ 1956] 0 1956 157550 1994 163840 0 -999 runc
[ 148.509901] [ 1963] 0 1963 180433 1827 184320 0 -999 runc
[ 148.521920] [ 1997] 0 1997 176345 8294 315392 0 -999 exe
[ 148.533567] [ 1998] 0 1998 195130 8036 323584 0 -999 exe
[ 148.544973] [ 2000] 0 2000 175657 1082 159744 0 -999 containerd-shim
[ 148.557118] [ 2008] 0 2008 157902 1779 155648 0 -999 runc
[ 148.568521] [ 2035] 0 2035 158264 8258 311296 0 -999 exe
[ 148.579998] [ 2046] 0 2046 102533 7626 241664 0 -999 exe
[ 148.591302] [ 2052] 0 2052 102533 7807 253952 0 -999 exe
[ 148.602657] [ 2131] 0 2131 21926 2589 110592 0 -999 exe
[ 148.614065] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/podruntime.slice,task_memcg=/podruntime.slice/docker.service,task=exe,pid=2131,uid=0
[ 148.633089] Memory cgroup out of memory: Killed process 2131 (exe) total-vm:87704kB, anon-rss:8988kB, file-rss:1368kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:-999
[ 148.654143] oom_reaper: reaped process 2131 (exe), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@cyrus-mc
Copy link

cyrus-mc commented Mar 7, 2020

Running this in preproduction and nodes are failing due to OOM. Your calculation appears to be off.

@natherz97
Copy link
Contributor Author

@cyrus-mc Thanks for commenting, could you provide us with more information? What memory usages are you observing for the kubelet and Docker when the OOM kill occurs? Also how many pods are you running on that instance?

Note that by specifying the --kube-reserved-cgroup kubelet flag, you're choosing to enforce kube-reserved on system daemons. Meaning that when these daemons exceed their resource reservation, they will be OOM killed causing the node and all pods running on the node to become available.

Currently, we choose not to enforce kube-reserved on system daemons. This allows the kubelet and Docker to exceed their resource reservation without becoming OOM killed. If the memory available on the node drops below the eviction-threshold, pods will start to be evicted from the node while the kubelet, container runtime, and other pods running on the node stay alive.

@cyrus-mc
Copy link

@natherz97

In my case this was a t3.small and is running (only daemonsets) 7 PODs (but this also happened on pretty much every instance type I spun up).

For the t3.small case I think the memory is set to 301. I did not capture the output of dmesg to see the breakdown of the memory usage.

And yes, It is actually --enforce-node-allocatable that enforces kube-serverd (setting the cgroup option actually doesn't do anything without kube-reserved set in enforce). I stopped enforcing kube-reserved.

@natherz97
Copy link
Contributor Author

The trade off of enforcing kube-reserved under a cgroup is a higher resource reservation to avoid the system daemons from ever exceeding their resource reservation and being OOM killed. The approach we have been following is not enforcing kube-reserved as a part of --enforce-node-allocatable. This allows the system daemons to temporarily exceed their reservation while also allowing us to set values for kube-reserved closer the average resource usage, rather than their maximum usage.

We've followed this approach because it prevents the kubelet and container runtime from being killed when the node has more available resources. I believe most customers prefer a lower reservation on kube-reserved without enforcing it to be able to run more pods on a given worker node, reducing the number of nodes needed to support their workload and lowering costs. I'm waiting for more feedback from the team to determine we want to keep following this approach, or if we want to set a higher reservation that allows kube-reserved to be enforced.

@mogren
Copy link

mogren commented Mar 27, 2020

Hmm, so looking at the log line:

[ 147.377885] Memory cgroup out of memory: Killed process 989 (kubelet) total-vm:856872kB, anon-rss:49600kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:464kB oom_score_adj:-999

It seems like go likes to over-allocate a lot when it starts up, a total-vm of 856,872 kB with only 49,600 kB anon-rss. That much allocated on a t3.small with 7 pods sounds like a lot.

I wonder if there is some way to limit using memory.high instead of memory.max for the kubelet...

@shyamjvs
Copy link
Contributor

@natherz97 Thanks for running those detailed tests - really nice work! As we were discussing offline earlier, there are couple things besides #pods that can affect mem usage:

  • Total #containers - as each pod can have multiple containers and kubelet/docker have some resource overhead per container). In light of this, can you clarify how many containers did the pods you were testing with have? If it was just one, can you run a quick test with two instead and verify mem usage doesn't vary greatly?
  • Pod churn - this can significantly alter resource usage too and a standard that sig-scalability works with is per-cluster pod churn of 20/s. Assuming 10 nodes in the cluster, that's about 2 pods/s churn per node. Such churn is not uncommon when creating/deleting deployment with large no. of replicas. Again, an ask here is can you try doubling the churn and verify how mem usage varies?

I won't block this PR on the above, but would suggest understanding the behavior in above scenarios so we can be more confidant with this change. I understand this is only setting kube-reserved and not enforcing it, but there is still some risk associated when lowering that because more/bigger pods could be scheduled on the node due to lowering kube-reserved and that can starve these daemons (ending up in a situation similar to not having kube-reserved at all).

@natherz97
Copy link
Contributor Author

@shyamjvs Thanks for the suggestions. I modified the upstream kubelet_perf test to run with chaoskube killing a random pod in the test namespace every one second instead of two and to run test pods with two containers instead of one. I ran this test against a cluster with three worker nodes.

Results for Kubelet and container runtime memory_rss usage in MiB:

Pod churn: kill a random pod every 2s
Total number of containers: running test pods with 1 container
m5.large(2 vCPU, 8 GiB, 29 pods): 246.88 MiB, 262.56 MiB, 227.92 MiB

Pod churn: kill a random pod every 1s
Total number of containers: running test pods with 2 containers
m5.large(2 vCPU, 8 GiB, 29 pods): 251.27 MiB, 246.83 MiB, 265.27 MiB

It appears doubling the pod churn as well as number of containers running in the e2e test doesn't significantly increase the memory_rss required by Kubernetes system daemons.

@shyamjvs
Copy link
Contributor

Interesting and surprising that doubling pod churn and #containers has almost no affect on the memory usage. Are you reporting average mem usage (across the test run) or peak mem usage? Averaging might hide the effect of this increase. If so, looking at P99 or P90 mem usage might help.

@shyamjvs
Copy link
Contributor

Nathan - An experiment with a single node (for deterministic results) where you change each of pod churn and #containers (though one at a time, not both at once) can help understand how those affect memory usage each. Can you post the results of that experiment below? So we can make a more informed choice.

@natherz97
Copy link
Contributor Author

@shyamjvs Here's the requested data. The results suggest we should be accounting for pods with more than one container. I'll collect more data on the resource usage of worker nodes of different instance types that are running pods with three containers.

Results for Kubelet and container runtime memory_rss usage in MiB:
Running tests on cluster with one t3.xlarge worker node.

Control
Pod churn: no churn
Total number of containers: running test pods with 1 container

Resource usage on node "ip-192-168-234-95.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       0.125      884.07                 666.20
"runtime" 0.039      405.00                 223.82
"kubelet" 0.068      88.65                  89.53

t3.xlarge (4 vCPU, 16 GiB, 58 pods): 223.82 + 89.53 = 313.35 MiB

Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 338 MiB.
Screen Shot 2020-03-31 at 2 36 44 PM

High Pod Churn
Pod churn: kill a random pod every 3s
Total number of containers: running test pods with 1 container

Resource usage on node "ip-192-168-234-95.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       1.137      1165.28                921.56
"runtime" 0.624      416.74                 215.18
"kubelet" 0.316      109.04                 106.36

t3.xlarge (4 vCPU, 16 GiB, 58 pods): 215.18 + 106.36 = 321.54 MiB

Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 355 MiB.
Screen Shot 2020-03-31 at 2 50 40 PM

Increased Containers per Pod
Pod churn: no churn
Total number of containers: running test pods with 5 container

Resource usage on node "ip-192-168-234-95.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       0.219      1565.58                1325.04
"runtime" 0.061      805.74                 493.85
"kubelet" 0.132      161.08                 143.54

t3.xlarge (4 vCPU, 16 GiB, 58 pods): 493.85 + 143.54 = 637.39 MiB

Viewing memory usage under system.slice (includes KubeReserved and SystemReserved daemons). Maximum rss_memory was 660 MiB.

Screen Shot 2020-04-01 at 8 38 56 PM

@shyamjvs
Copy link
Contributor

shyamjvs commented Apr 6, 2020

Thanks for carrying out those experiments @natherz97! Those are some interesting results. So it seems like #containers can significantly effect mem usage, while pod churn doesn't seem to affect too much.

From my experience, almost all pods I've seen running in production have <= 2 containers per pod. Rarely 3 containers and very rarely 4 containers. Not sure if there any studies giving this data (maybe we can obtain that?), but for a start I think we can assume 2 containers for now, to not eat up too much for kube-reserved. Can you check what's the RSS mem usage with 2 containers?

@natherz97
Copy link
Contributor Author

Tests against container type:
Using one t3.xlarge worker node:

Note that kubelet and Docker, among other system processes, run under system.slice CGroup:
$ systemctl status 4065 | grep CGroup
CGroup: /system.slice/kubelet.service

$ systemctl status 3407 | grep CGroup
CGroup: /system.slice/docker.service

1 container per pod:
Running 50 pods with couchbase container: (system.slice: 425 MiB, kubepods: 8413 MiB)
Running 50 pods with nginx container: (system.slice: 412 MiB, kubepods: 540 MiB)
Running 50 pods with traefik container: (system.slice: 413 MiB, kubepods: 871 MiB)
Running 50 pods with zookeeper container: (system.slice: 429 MiB, kubepods: 3498 MiB)

2 containers per pod:
Running 50 pods with nginx and zookeeper containers: (system.slice: 478 MiB, kubepods: 3614 MiB)
Running 50 pods with nginx and couchbase containers: (system.slice: 486 MiB, kubepods: 8753 MiB)

3 containers per pod:
Running 50 pods with ngninx, couchbase, and zookeeper containers: (system.slice: 540 MiB, kubepods: 11444 MiB)

Conclusion: The container type used in pods does not have a significant affect on Kubernetes system daemon memory usage. As a result, system daemon memory usage on a worker node is a function of the number of pods and the number of containers per pod.

Modifying previous experiment for a more realistic workload:
Running kubelet_perf e2e test on a cluster with one worker node and modify the test to run with the maximum number of pods that instance type supports. We’ll also modify the pod spec to use three containers per pod instead of one.

Note that anyone could arbitrarily increase Kubernetes system daemon to exceed the memory available on any node by running the maximum number of pods per node with a large number of containers per pod (>= 4). We’re assuming most pod specs will not exceed three containers.

Data:

t3.medium (2 vCPU, 4 GiB, 17 pods): 138.52 + 69.21 = 207.73 MiB

Resource usage on node "ip-192-168-81-39.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"kubelet" 0.036      98.22                  69.21
"/"       0.078      743.07                 511.57
"runtime" 0.022      282.64                 138.52

t3.large (2 vCPU, 8 GiB, 35 pods): 240.78 + 51.77 = 292.55 MiB

Resource usage on node "ip-192-168-134-34.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       0.102      583.54                 340.38
"runtime" 0.037      415.26                 240.78
"kubelet" 0.063      88.51                  51.77

t3.xlarge (4 vCPU, 16 GiB, 58 pods): 369.96 + 86.16 = 456.12 MiB

Resource usage on node "ip-192-168-164-214.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       0.157      742.51                 511.59
"runtime" 0.060      599.84                 369.96
"kubelet" 0.104      120.38                 86.16

t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 112.23 + 372.78 = 485.01 MiB

Resource usage on node "ip-192-168-154-29.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"runtime" 0.042      641.54                 372.78
"kubelet" 0.094      123.40                 112.23
"/"       0.140      784.74                 549.65

m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 1453.93 + 185.07 = 1639.00 MiB

Resource usage on node "ip-192-168-206-227.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       0.558      1943.33                1744.25
"runtime" 0.162      2096.04                1453.93
"kubelet" 0.343      273.49                 185.07

m5.24xlarge (96 vCPU, 384 GiB, Supports 737 pods but ran with 500 pods): 4907.71 + 617.04 = 5524.75 MiB

Resource usage on node "ip-192-168-255-127.us-west-2.compute.internal":
container cpu(cores) memory_working_set(MB) memory_rss(MB)
"/"       1.569      5641.97                5758.40
"runtime" 0.323      6361.22                4907.71
"kubelet" 1.153      495.03                 617.04

Graph:
Screen Shot 2020-04-06 at 11 51 02 AM

Formula:
MemoryToReserve = 11 * NumPods + 255

Example KubeReservedMemory Reservations:
t3.medium (2 vCPU, 4 GiB, 17 pods): 442 MiB
t3.large (2 vCPU, 8 GiB, 35 pods): 640 MiB
t3.xlarge (4 vCPU, 16 GiB, 58 pods): 893 MiB
t3.2xlarge (8 vCPU, 32 GiB, 58 pods): 893 MiB
m5.4xlarge (16 vCPU, 64 GiB, 234 pods): 2829 MiB
m5.24xlarge (96 vCPU, 384 GiB, 737 pods): 8362 MiB

I have updated the PR to use this formula.

Copy link

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome testing @natherz97! :)

@shyamjvs
Copy link
Contributor

shyamjvs commented Apr 7, 2020

Indeed, very exhaustive testing Nathan. Good work!

/lgtm

@natherz97 natherz97 merged commit b56a25d into awslabs:master Apr 8, 2020
@ksemaev
Copy link

ksemaev commented Apr 8, 2020

@natherz97 huge thanks for that work! Can you please when the new ami will be published, we're spending a lot more with current setup due to giant unused headroom per node.

@natherz97
Copy link
Contributor Author

@ksemaev There is a release going out today which doesn't have this commit but the subsequent release, which should be within a couple weeks, will have it.

shabir61 added a commit to OpenGov/amazon-eks-ami that referenced this pull request Apr 20, 2020
* Moving log collector script to Amazon eks ami repo (awslabs#243)

* Moving log collector script to this repo

* Added changes according to 1.4.1

* Update eks-log-collector.sh URL on readme

The instructions on readme were still pointing at the original repository. Updating to reflect the new location

* remove kubectl dependency (awslabs#295)

* Added CHANGELOG for v20190701

* Install ec2-instance-connect

* refactor packer variables

* Add c5.12xlarge and c5.24xlarge instances

* Add new m5 and r5 instances

* Fix t3a.small limit

* add support for ap-east-1 region (awslabs#305)

* 2107 allow private ssh when building (awslabs#303)

* added a set of variables to allow private ssh to non-default vpc

* make filepaths of ./files/ and install-worker relative to packer template dir

* updated ami_description to a variable

* change the amiName pattern to use minor version (awslabs#307)

* update S3_URL_BASE environment variable in install-worker.sh

* v20190814 release (awslabs#316)

* Update list of instance types (awslabs#320)

* Add all new instance types already added to the CNI

* Add support for the u-*tb1.metal instances (Fix awslabs#319)

* add support for me-south-1 region (awslabs#322)

* Adding new directory and file for 1.14 and above by removing --allow-privileged=true flag (awslabs#327)

* Add Change log for AMI Release v20190906 (awslabs#329)

* sync nodegroup template to latest available (awslabs#335)

* sync eks node group template to be latest available
1. add support to use ssm parameter for amiID
2. add support for all instance types supported by cni
3. formatted with rain(https://github.com/aws-cloudformation/rain)

* add new CFN version 2019-09-17

* Add support for g4 instance family

* Add G4DN instance family to node group template

* Add change log for AMI Release v20190927 (awslabs#345)

* Add 1.14 to the EKS Makefile and update older versions (awslabs#336)

Add 1.14 to the list of Makefile targets.

Remove 1.10 as it's no longer a supported version

Update versions and build dates for older EKS versions

* Add support for m5n/m5dn/r5n/r5dn instances

* Remove snowflake for kubelet secret-polling config (awslabs#352)

* Set a minimum evictionHard and kubeReserved

* Output the autoscaling group name

This name of the AutoScaling Group is useful for things like the Cluster Autoscaler so that it can manage automatic cluster scaling.

* awslabs#361 - custom pause container image support (awslabs#362)

* awslabs#361 - custom pause container image support

* Set kubeReserved dynamically and evictionHard statically (awslabs#367)

* Updating Docker version (awslabs#373)

* Remove the ec2-net-utils package (awslabs#368)

* Remove the ec2-net-utils package

* Add code comment to describe the ec2-net-utils change

* Make 'kube-bench' happy.

Signed-off-by: Bruno Miguel Custódio <[email protected]>

* add support for c5d.12x/c5d.24x/c5d.metal

* Adding new instance types (m6g) (awslabs#378)

* Revert "Make 'kube-bench' happy." since there are changes being concerned (awslabs#381)

This reverts commit 593691e.

* Fixed setting of DNS_CLUSTER_IP in bootstrap.sh (awslabs#226)

* Replaced API calls for deciding DNS_CLUSTER_IP with arg
* Bypass the metadata calls to avoid 404 errors
* Fall back to MAC logic if --dns-cluster-ip is absent
* Updated comment for --dns-cluster-ip

* Support docker-in-docker by only returning the oldest dockerd process

* TLS Ciphersuite: restrict to TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

See section 2.1.14 of the CIS benchmark:

> [2.1.14] Ensure that the Kubelet only makes use of Strong Cryptographic Ciphers
> If using a Kubelet config file, edit the file to set TLSCipherSuites: to TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
> If using executable arguments, edit the kubelet service file /etc/systemd/system/kubelet.service on each worker node and set the below parameter.
> --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256

Note that this is a regression, this had been set previously in PR awslabs#276
but got lost in awslabs#352.

* Script for collecting window and ubuntu worker logs (awslabs#354)

* Script for collecting window worker logs
* Ubuntu support and directory re-org
* Collect files from EKS logs folder
* Updates to kubelet svc and kubeconfig
* Updated Readme for Windows

* add ability to specify aws_region & binary_bucket_region & source_ami_owners (awslabs#396)

* adding support for china regions (awslabs#398)

* kubelet.service should wait for iptables lock (awslabs#401)

This commit makes kubelet.service wait up to 5 seconds for an iptables lock in the `ExecStartPre` step, instead of failing immediately if something else is holding the lock.

* fix tls suit to be recommended by cis bench (awslabs#403)

* Fix retries in bootstrap.sh

If `aws eks describe-cluster` fails the first time, the retries never work because the `rc` value is never able to be set back to zero

* update binaries to use latest ones (awslabs#408)

* validate_yum (awslabs#411)

* add ability to use precreated security group (awslabs#412)

* add scripts folder (awslabs#413)

* Remove invalid target 1.11 (awslabs#421)

Currently, AWS EKS is no longer support Kubernetes 1.11

* Update install-worker.sh and eks-worker-al2.json (awslabs#402)

* Update install-worker.sh and eks-worker-al2.json

* Update kubelet.service

* added ability to share amis in builder

* added ability to share amis in builder

* Rebasing from master

* Added remote_folder to cleanup_additional_repos.sh provisioner

* Added remote_folder to install_additional_repos.sh provisioner

* Added remote_folder to validate.sh provisioner

* adding support for 1.14 and updating cni

* cni to v0.6.0 back - as v.0.7.1 has no binary

* reverting back to plugins version

* Remove mutating calls and ignore collection of unknown logs

* Added 1.15 support and removed --allow-privileged flag from all EKS supported versions (1.12+). (awslabs#428)

* Fix URL for 1.15 binaries (awslabs#429)

* Fixed amazon-eks-nodegroup.yaml lint issues

* Consistent Docker GID version in Image (awslabs#430)

* Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install

* Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install

Co-authored-by: Janis Orlovs <[email protected]>

* Move compressed file to /var/log (awslabs#436)

* Force create the group id (awslabs#437)

"the -f is force, -o is overwrite, meaning if there is an existing group with number 1950, it will create a new one with the name docker"

* Fix useradd to run with privileges

* Removing dependency on Authenticator binary (awslabs#440)

* Reducing memory allocated in kubeReserved (awslabs#419)

* Revert "Removing dependency on Authenticator binary (awslabs#440)" (awslabs#446)

This reverts commit 4e0e916.

* Adding support to upgrade kernel while building AMI (awslabs#447)

* fix(amazon-eks-nodegroup): add ec2 service principals for isolated regions

* Add inf1 instance family in EKS AMI packer configuration

* Removed AssociatePublicIpAddress setting from NodeLaunchCongig and added NodeSecurityGroup dependency to SG Ingress/Egress (awslabs#450)

Co-authored-by: Vishal Gupta <[email protected]>

* added 1.15

* updated Jenkinsfile

* updated kubelet latest from main source

* typo

* make kubelet service matches original master branch

* Makefile updated

* updated a few more

* newline - yea newline

* revert back to 1.15.10

* updated install-worker.sh

Co-authored-by: Nithish <[email protected]>
Co-authored-by: Hugo Ribeiro <[email protected]>
Co-authored-by: M00nF1sh <[email protected]>
Co-authored-by: Micah Hausler <[email protected]>
Co-authored-by: Matthew Wong <[email protected]>
Co-authored-by: Claes Mogren <[email protected]>
Co-authored-by: wong yan yee <[email protected]>
Co-authored-by: blakeroberts-wk <[email protected]>
Co-authored-by: josselin-c <[email protected]>
Co-authored-by: Bhagwat kumar Singh <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Will Thames <[email protected]>
Co-authored-by: Shyam JVS <[email protected]>
Co-authored-by: Dwayne Bailey <[email protected]>
Co-authored-by: Andrew Johnstone <[email protected]>
Co-authored-by: natherz97 <[email protected]>
Co-authored-by: Kausheel Kumar <[email protected]>
Co-authored-by: Bruno Miguel Custódio <[email protected]>
Co-authored-by: ajayk <[email protected]>
Co-authored-by: sramabad1 <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Co-authored-by: Andrew Hemming <[email protected]>
Co-authored-by: Eric Webster <[email protected]>
Co-authored-by: Florent Delannoy <[email protected]>
Co-authored-by: Arun Bhagyanath <[email protected]>
Co-authored-by: Justin Owen <[email protected]>
Co-authored-by: Aaron Ackerman <[email protected]>
Co-authored-by: Tam Mach <[email protected]>
Co-authored-by: zadowsmash <[email protected]>
Co-authored-by: Shabir Ahmed <[email protected]>
Co-authored-by: Abeer Sethi <[email protected]>
Co-authored-by: Will Thames <[email protected]>
Co-authored-by: Octavio Martin <[email protected]>
Co-authored-by: Jānis Orlovs <[email protected]>
Co-authored-by: Janis Orlovs <[email protected]>
Co-authored-by: Divyesh Khandeshi <[email protected]>
Co-authored-by: cmdallas <[email protected]>
Co-authored-by: gaogilb <[email protected]>
Co-authored-by: Vishal Gupta <[email protected]>
Co-authored-by: Vishal Gupta <[email protected]>
@asheldon
Copy link

asheldon commented Apr 23, 2020

@natherz97 Should the script account for max-pods passed to the kubelet, e.g. --kubelet-extra-args '--max-pods=20'? Pod density is lower with custom networking, so the kubelet reserved resources could be lower as well.

@lkoniecz
Copy link

@natherz97 When is it going to be released? I am referring to https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html, the most recent eu-west-1 image for EKS 1.14 does not have that change

@natherz97
Copy link
Contributor Author

@lkoniecz New AMIs with this change were released yesterday!

@natherz97
Copy link
Contributor Author

@asheldon Good point, I think that's an optimization we could make for workloads with a smaller number of pods but a higher resource usage. That change could especially help on worker nodes which support 234 - 737 pods, but typically run at a lower pod density. The purpose of this PR was to come up with a suitable reservation for most workloads based on the maximum number of pods per node (and the number of containers per pod) to get an upper bound on daemon resource usage. In the future, I agree we should look for optimizations like the one you suggested.

@manvinderr21
Copy link

manvinderr21 commented May 1, 2020

@natherz97 Thanks, Now our instance memory increased by a significant amount.

sakomws pushed a commit to OpenGov/amazon-eks-ami that referenced this pull request May 13, 2020
* Moving log collector script to Amazon eks ami repo (awslabs#243)

* Moving log collector script to this repo

* Added changes according to 1.4.1

* Update eks-log-collector.sh URL on readme

The instructions on readme were still pointing at the original repository. Updating to reflect the new location

* remove kubectl dependency (awslabs#295)

* Added CHANGELOG for v20190701

* Install ec2-instance-connect

* refactor packer variables

* Add c5.12xlarge and c5.24xlarge instances

* Add new m5 and r5 instances

* Fix t3a.small limit

* add support for ap-east-1 region (awslabs#305)

* 2107 allow private ssh when building (awslabs#303)

* added a set of variables to allow private ssh to non-default vpc

* make filepaths of ./files/ and install-worker relative to packer template dir

* updated ami_description to a variable

* change the amiName pattern to use minor version (awslabs#307)

* update S3_URL_BASE environment variable in install-worker.sh

* v20190814 release (awslabs#316)

* Update list of instance types (awslabs#320)

* Add all new instance types already added to the CNI

* Add support for the u-*tb1.metal instances (Fix awslabs#319)

* add support for me-south-1 region (awslabs#322)

* Adding new directory and file for 1.14 and above by removing --allow-privileged=true flag (awslabs#327)

* Add Change log for AMI Release v20190906 (awslabs#329)

* sync nodegroup template to latest available (awslabs#335)

* sync eks node group template to be latest available
1. add support to use ssm parameter for amiID
2. add support for all instance types supported by cni
3. formatted with rain(https://github.com/aws-cloudformation/rain)

* add new CFN version 2019-09-17

* Add support for g4 instance family

* Add G4DN instance family to node group template

* Add change log for AMI Release v20190927 (awslabs#345)

* Add 1.14 to the EKS Makefile and update older versions (awslabs#336)

Add 1.14 to the list of Makefile targets.

Remove 1.10 as it's no longer a supported version

Update versions and build dates for older EKS versions

* Add support for m5n/m5dn/r5n/r5dn instances

* Remove snowflake for kubelet secret-polling config (awslabs#352)

* Set a minimum evictionHard and kubeReserved

* Output the autoscaling group name

This name of the AutoScaling Group is useful for things like the Cluster Autoscaler so that it can manage automatic cluster scaling.

* awslabs#361 - custom pause container image support (awslabs#362)

* awslabs#361 - custom pause container image support

* Set kubeReserved dynamically and evictionHard statically (awslabs#367)

* Updating Docker version (awslabs#373)

* Remove the ec2-net-utils package (awslabs#368)

* Remove the ec2-net-utils package

* Add code comment to describe the ec2-net-utils change

* Make 'kube-bench' happy.

Signed-off-by: Bruno Miguel Custódio <[email protected]>

* add support for c5d.12x/c5d.24x/c5d.metal

* Adding new instance types (m6g) (awslabs#378)

* Revert "Make 'kube-bench' happy." since there are changes being concerned (awslabs#381)

This reverts commit 593691e.

* Fixed setting of DNS_CLUSTER_IP in bootstrap.sh (awslabs#226)

* Replaced API calls for deciding DNS_CLUSTER_IP with arg
* Bypass the metadata calls to avoid 404 errors
* Fall back to MAC logic if --dns-cluster-ip is absent
* Updated comment for --dns-cluster-ip

* Support docker-in-docker by only returning the oldest dockerd process

* TLS Ciphersuite: restrict to TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

See section 2.1.14 of the CIS benchmark:

> [2.1.14] Ensure that the Kubelet only makes use of Strong Cryptographic Ciphers
> If using a Kubelet config file, edit the file to set TLSCipherSuites: to TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
> If using executable arguments, edit the kubelet service file /etc/systemd/system/kubelet.service on each worker node and set the below parameter.
> --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256

Note that this is a regression, this had been set previously in PR awslabs#276
but got lost in awslabs#352.

* Script for collecting window and ubuntu worker logs (awslabs#354)

* Script for collecting window worker logs
* Ubuntu support and directory re-org
* Collect files from EKS logs folder
* Updates to kubelet svc and kubeconfig
* Updated Readme for Windows

* add ability to specify aws_region & binary_bucket_region & source_ami_owners (awslabs#396)

* adding support for china regions (awslabs#398)

* kubelet.service should wait for iptables lock (awslabs#401)

This commit makes kubelet.service wait up to 5 seconds for an iptables lock in the `ExecStartPre` step, instead of failing immediately if something else is holding the lock.

* fix tls suit to be recommended by cis bench (awslabs#403)

* Fix retries in bootstrap.sh

If `aws eks describe-cluster` fails the first time, the retries never work because the `rc` value is never able to be set back to zero

* update binaries to use latest ones (awslabs#408)

* validate_yum (awslabs#411)

* add ability to use precreated security group (awslabs#412)

* add scripts folder (awslabs#413)

* Remove invalid target 1.11 (awslabs#421)

Currently, AWS EKS is no longer support Kubernetes 1.11

* Update install-worker.sh and eks-worker-al2.json (awslabs#402)

* Update install-worker.sh and eks-worker-al2.json

* Update kubelet.service

* added ability to share amis in builder

* added ability to share amis in builder

* Rebasing from master

* Added remote_folder to cleanup_additional_repos.sh provisioner

* Added remote_folder to install_additional_repos.sh provisioner

* Added remote_folder to validate.sh provisioner

* Remove mutating calls and ignore collection of unknown logs

* Added 1.15 support and removed --allow-privileged flag from all EKS supported versions (1.12+). (awslabs#428)

* Fix URL for 1.15 binaries (awslabs#429)

* Fixed amazon-eks-nodegroup.yaml lint issues

* Consistent Docker GID version in Image (awslabs#430)

* Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install

* Docker install across versions change GID for docker, this causes problems for consistency. This commit solves it by adding same GID to docker install

Co-authored-by: Janis Orlovs <[email protected]>

* Move compressed file to /var/log (awslabs#436)

* Force create the group id (awslabs#437)

"the -f is force, -o is overwrite, meaning if there is an existing group with number 1950, it will create a new one with the name docker"

* Fix useradd to run with privileges

* Removing dependency on Authenticator binary (awslabs#440)

* Reducing memory allocated in kubeReserved (awslabs#419)

* Revert "Removing dependency on Authenticator binary (awslabs#440)" (awslabs#446)

This reverts commit 4e0e916.

* Adding support to upgrade kernel while building AMI (awslabs#447)

* fix(amazon-eks-nodegroup): add ec2 service principals for isolated regions

* Add inf1 instance family in EKS AMI packer configuration

* Removed AssociatePublicIpAddress setting from NodeLaunchCongig and added NodeSecurityGroup dependency to SG Ingress/Egress (awslabs#450)

Co-authored-by: Vishal Gupta <[email protected]>

* Add a flag that allows CNI packages to be pulled from S3 instead of Github. (awslabs#457)

The default behavior is unchanged and will still pull assets from
Github.

* update source AMI owner and ECR repo for govcloud (awslabs#458)

* updated ipamd information files extension to json (awslabs#451)

* updated ipamd data file extension to json

* updated ipamd metrics file extension

* Adding 1.16 to Makefile (awslabs#459)

* downgrade

* Add a new manifest containing the AMI name (awslabs#471)

This commit adds a new manifest which contains AMI name in the manifest filename so that parallel builds can be triggered. Even though the new manifest is now generated along with the current one for backwards compatibility, eventually the old manifest (manifest.json) will be deprecated.

* changelog updated

* added udev setting

* small updates

* some fix

* added udev again

Co-authored-by: Nithish <[email protected]>
Co-authored-by: Hugo Ribeiro <[email protected]>
Co-authored-by: M00nF1sh <[email protected]>
Co-authored-by: Micah Hausler <[email protected]>
Co-authored-by: Matthew Wong <[email protected]>
Co-authored-by: Claes Mogren <[email protected]>
Co-authored-by: wong yan yee <[email protected]>
Co-authored-by: blakeroberts-wk <[email protected]>
Co-authored-by: josselin-c <[email protected]>
Co-authored-by: Bhagwat kumar Singh <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Will Thames <[email protected]>
Co-authored-by: Shyam JVS <[email protected]>
Co-authored-by: Dwayne Bailey <[email protected]>
Co-authored-by: Andrew Johnstone <[email protected]>
Co-authored-by: natherz97 <[email protected]>
Co-authored-by: Kausheel Kumar <[email protected]>
Co-authored-by: Bruno Miguel Custódio <[email protected]>
Co-authored-by: ajayk <[email protected]>
Co-authored-by: sramabad1 <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Co-authored-by: Andrew Hemming <[email protected]>
Co-authored-by: Eric Webster <[email protected]>
Co-authored-by: Florent Delannoy <[email protected]>
Co-authored-by: Arun Bhagyanath <[email protected]>
Co-authored-by: Justin Owen <[email protected]>
Co-authored-by: Aaron Ackerman <[email protected]>
Co-authored-by: Tam Mach <[email protected]>
Co-authored-by: zadowsmash <[email protected]>
Co-authored-by: Abeer Sethi <[email protected]>
Co-authored-by: Will Thames <[email protected]>
Co-authored-by: Octavio Martin <[email protected]>
Co-authored-by: Jānis Orlovs <[email protected]>
Co-authored-by: Janis Orlovs <[email protected]>
Co-authored-by: Divyesh Khandeshi <[email protected]>
Co-authored-by: cmdallas <[email protected]>
Co-authored-by: gaogilb <[email protected]>
Co-authored-by: Vishal Gupta <[email protected]>
Co-authored-by: Vishal Gupta <[email protected]>
Co-authored-by: Bronson Mirafuentes <[email protected]>
Co-authored-by: Sai Teja Penugonda <[email protected]>
Co-authored-by: Shabir Ahmed <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
@cshivashankar
Copy link

@natherz97 Are these values good for docker version 19.03. I am seeing a lot of PLEG not healthy issues with docker daemon 19.03. This issue was nonexistent in our environment.
Nodes had higher system resources reserved before this PR. I am just wondering if docker daemon is getting overwhelmed and allocating more resources can be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants