Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node becomes NotReady #79

Closed
agcooke opened this issue Oct 31, 2018 · 79 comments
Closed

Node becomes NotReady #79

agcooke opened this issue Oct 31, 2018 · 79 comments

Comments

@agcooke
Copy link
Contributor

agcooke commented Oct 31, 2018

We are running EKS in Ireland and our nodes are going unhealthy regularly.

It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.

We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.

Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.

We are using the cluster for running tests, so pods are getting created and destroyed often.

We have not done anything described in #51 for log rotation.

Cluster Information:
CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1
Region: eu-west-1

LOGS

** Node AMI**

AMI ID amazon-eks-node-v24 (ami-0c7a4976cb6fafd3a)

** File system **

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1,9G     0  1,9G   0% /dev
tmpfs           1,9G     0  1,9G   0% /dev/shm
tmpfs           1,9G  2,3M  1,9G   1% /run
tmpfs           1,9G     0  1,9G   0% /sys/fs/cgroup
/dev/nvme0n1p1   64G   40G   25G  62% /
tmpfs           389M     0  389M   0% /run/user/1000

** kubectl describe node**

Name: ip-<secret>.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1b
kubernetes.io/hostname=ip-<secret>.eu-west-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 30 Oct 2018 11:25:48 +0100
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Wed, 31 Oct 2018 10:56:53 +0100 Tue, 30 Oct 2018 11:25:46 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: <secret>
Hostname: ip-<secret>..eu-west-1.compute.internal
Capacity:
cpu: 2
ephemeral-storage: 67096556Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3980344Ki
pods: 17
Allocatable:
cpu: 2
ephemeral-storage: 61836185908
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3877944Ki
pods: 17
System Info:
Machine ID: asdf
System UUID: asdf
Boot ID: asdf
Kernel Version: 4.14.62-70.117.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ProviderID: aws:///eu-west-1b/i-<secret>
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system aws-node-hshhg 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-fkrb8 100m (5%) 0 (0%) 0 (0%) 0 (0%)
monitoring datadog-datadog-bk5bd 200m (10%) 200m (10%) 256Mi (6%) 256Mi (6%)
monitoring prometheus-node-exporter-4z2dg 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t1 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t3 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t5 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t6 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t7 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t8 100m (5%) 0 (0%) 256Mi (6%) 0 (0%)
t9 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t10 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1160m (57%) 950m (47%)
memory 2012Mi (53%) 1756Mi (46%)
Events: <none>

journalctl logs around the time

okt 31 10:01:29 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent: page allocation stalls for 10404ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:01:30 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent cpuset=1ef2c300b3981b045f3f2fcab050f674afead7e7c828362ec2d40ef82bf02441 mems_allowed=0
okt 31 10:01:31 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 6267 Comm: aws-k8s-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:01:34 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:01:36 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:01:38 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:01:39 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:01:41 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:01:42 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:01:45 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:01:46 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:01:47 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:01:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:01:52 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:01:55 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:01:57 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:00 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:02 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:07 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:09 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 0001:0x1f
okt 31 10:02:12 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0000:000000c420170f58 EFLAGS: 4d32dce245d7
okt 31 10:02:15 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:16 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895836 inactive_anon:8314 isolated_anon:0
active_file:413 inactive_file:596 isolated_file:0
unevictable:0 dirty:1 writeback:0 unstable:0
slab_reclaimable:17241 slab_unreclaimable:26888
mapped:22510 shmem:28069 pagetables:7173 bounce:0
free:21650 free_pcp:12 free_cma:0
okt 31 10:02:17 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 active_anon:3583344kB inactive_anon:33256kB active_file:1652kB inactive_file:2384kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:90040kB dirty:4kB writeback:0kB shmem:112276kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 16384kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
okt 31 10:02:19 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA free:15620kB min:268kB low:332kB high:396kB active_anon:288kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
okt 31 10:02:22 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 2951 3849 3849
okt 31 10:02:24 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32 free:54972kB min:51600kB low:64500kB high:77400kB active_anon:2799324kB inactive_anon:31696kB active_file:556kB inactive_file:816kB unevictable:0kB writepending:0kB present:3129320kB managed:3044324kB mlocked:0kB kernel_stack:7968kB pagetables:19844kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB
okt 31 10:02:27 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 898 898
okt 31 10:02:30 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal free:15408kB min:15708kB low:19632kB high:23556kB active_anon:783732kB inactive_anon:1560kB active_file:884kB inactive_file:1392kB unevictable:0kB writepending:4kB present:987136kB managed:920112kB mlocked:0kB kernel_stack:4304kB pagetables:8848kB bounce:0kB free_pcp:420kB local_pcp:0kB free_cma:0kB
okt 31 10:02:32 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 0 0
okt 31 10:02:34 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 2*32kB (UM) 3*64kB (UM) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (ME) = 15620kB
okt 31 10:02:36 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32: 1659*4kB (UME) 1496*8kB (UME) 1181*16kB (UME) 446*32kB (UME) 54*64kB (UME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55356kB
okt 31 10:02:37 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal: 334*4kB (UMEH) 351*8kB (UMEH) 431*16kB (UMEH) 93*32kB (UMEH) 4*64kB (H) 2*128kB (H) 4*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15552kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 28769 total pagecache pages
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages in swap cache
okt 31 10:02:42 ip-<secret>.eu-west-1.compute.internal kernel: Swap cache stats: add 0, delete 0, find 0/0
okt 31 10:02:43 ip-<secret>.eu-west-1.compute.internal kernel: Free swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal systemd-journal[26209]: Permanent journal is using 392.0M (max allowed 4.0G, trying to leave 4.0G free of 38.6G available → current limit 4.0G).
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: Total swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: 1033112 pages RAM
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages HighMem/MovableOnly
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 38026 pages reserved
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent: page allocation stalls for 10580ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent cpuset=67b33ad9edc4663ce3e97ac968df4726a9beeff073706349383b1e9eabd93125 mems_allowed=0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 7452 Comm: process-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x34/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x40/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 00f1:0x11
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0234:00007f768c4e5d38 EFLAGS: 00000000
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895956 inactive_anon:8314 isolated_anon:0
active_file:15 inactive_file:24 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:17237 slab_unreclaimable:26796
mapped:21743 shmem:28069 pagetables:7196 bounce:0
free:21560 free_pcp:682 free_cma:0

plugin logs

2018-10-31T10:50:15Z [INFO] Starting CNI Plugin v1.2.1  ...
2018-10-31T10:50:15Z [INFO] Received CNI del request: ContainerID(56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=monitoring;K8S_POD_NAME=datadog-datadog-bk5bd;K8S_POD_INFRA_CONTAINER_ID=56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-10-31T10:50:15Z [ERROR] Error received from DelNetwork grpc call for pod datadog-datadog-bk5bd namespace monitoring container 56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

ipamd.log

2018-10-31T10:05:43Z [DEBUG] Found ip addresses [10.0.1.72 10.0.1.208 10.0.1.36 10.0.1.86 10.0.1.219 10.0.1.63] on eni 02:af:21:3c:f9:4e
2018-10-31T10:05:44Z [DEBUG] Found eni mac address : 02:b3:1a:eb:c3:5e
2018-10-31T10:05:52Z [DEBUG] Using device number 0 for primary eni: eni-0f37efb5e4ebecf09
2018-10-31T10:05:52Z [DEBUG] Found eni: eni-0f37efb5e4ebecf09, mac 02:b3:1a:eb:c3:5e, device 0
2018-10-31T10:05:55Z [DEBUG] Found cidr 10.0.1.0/24 for eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Found ip addresses [10.0.1.143 10.0.1.96 10.0.1.65 10.0.1.209 10.0.1.134 10.0.1.8] on eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Reconcile existing ENI eni-0ce38d7ac411b07ab IP pool
2018-10-31T10:05:59Z [DEBUG] Reconcile and skip primary IP 10.0.1.117 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.53 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.53 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.102 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.102 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.120 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.120 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.42 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.42 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.59 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.59 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f1db76fd54b2e3f5 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.72 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.208 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.208 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.36 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.36 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.86 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.86 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.219 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.219 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.63 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.63 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f37efb5e4ebecf09 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.143 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.96 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.96 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.65 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.65 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.209 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.209 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.134 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.134 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.8 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.8 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Successfully Reconciled ENI/IP pool
2018-10-31T10:06:06Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:06Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:06Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:11Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:12Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:13Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:17Z [INFO]  Pods deleted on my node: t1
2018-10-31T10:06:18Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:19Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:19Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:25Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:25Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:26Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:33Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:34Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:35Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:35Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:07:19Z [INFO]  Pods deleted on my node: t2
2018-10-31T10:09:29Z [INFO]  Pods deleted on my node: t3
2018-10-31T10:14:30Z [ERROR] Failed to retrieve interfaces data from instance metadata RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:37Z [ERROR] ip pool reconcile: Failed to get attached eni infoget attached enis: failed to retrieve interfaces data: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:47Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:14:49Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:14:51Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:14:52Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:14:59Z [INFO]  Pods deleted on my node: t4
2018-10-31T10:15:43Z [DEBUG] Total number of interfaces found: 3 
2018-10-31T10:15:44Z [DEBUG] Found eni mac address : 02:63:a9:60:fc:42
@agcooke
Copy link
Contributor Author

agcooke commented Nov 6, 2018

Any ideas on this?

@max-rocket-internet
Copy link
Contributor

All those weird log messages about memory in your journalctl look a lot like OOM messages (but don't specifically mention OOM being started). And the Call Trace line looks like a Kernel bug.

One does free -m say on these nodes?

@radityasurya
Copy link

radityasurya commented Nov 9, 2018

The issues happening again, here are the logs during the node status becoming NotReady

kubectl describe node

kubectl describe node ip-<secret>.eu-west-1.compute.internal
Name:               ip-<secret>.eu-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.medium
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-west-1
                    failure-domain.beta.kubernetes.io/zone=eu-west-1b
                    kubernetes.io/hostname=ip-<secret>.eu-west-1.compute.internal
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Tue, 30 Oct 2018 11:09:42 +0100
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Fri, 09 Nov 2018 13:15:13 +0100   Fri, 09 Nov 2018 13:15:56 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Fri, 09 Nov 2018 13:15:13 +0100   Fri, 09 Nov 2018 13:15:56 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 09 Nov 2018 13:15:13 +0100   Fri, 09 Nov 2018 13:15:56 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Fri, 09 Nov 2018 13:15:13 +0100   Tue, 30 Oct 2018 11:09:41 +0100   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Fri, 09 Nov 2018 13:15:13 +0100   Fri, 09 Nov 2018 13:15:56 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
Addresses:
  InternalIP:  <secret>
  Hostname:    ip-<secret>.eu-west-1.compute.internal
Capacity:
 cpu:                2
 ephemeral-storage:  67096556Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             3980344Ki
 pods:               17Allocatable:
 cpu:                2 ephemeral-storage:  61836185908
 hugepages-1Gi:      0 hugepages-2Mi:      0
 memory:             3877944Ki pods:               17
System Info:
 Machine ID:                       ec2d77ff3bd31c1a42446de3e9ddd79e
 System UUID:                      EC2D77FF-3BD3-1C1A-4244-6DE3E9DDD79E Boot ID:                          da26a6a5-3a46-4405-8511-7bad8ecbbf3f
 Kernel Version:                   4.14.62-70.117.amzn2.x86_64 OS Image:                         Amazon Linux 2
 Operating System:                 linux
 Architecture:                     amd64
 Container Runtime Version:        docker://17.6.2
 Kubelet Version:                  v1.10.3
 Kube-Proxy Version:               v1.10.3
ExternalID:                        i-<secret>
ProviderID:                        aws:///eu-west-1b/i-<secret>
Non-terminated Pods:               (11 in total)
  Namespace                        Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                        ----                                                               ------------  ----------  ---------------  -------------
  kube-system                      aws-node-2bp7x                                                     10m (0%)      0 (0%)      0 (0%)           0 (0%)
  kube-system                      kube-proxy-k4c68                                                   100m (5%)     0 (0%)      0 (0%)           0 (0%)
  monitoring                       datadog-datadog-474bp                                              200m (10%)    200m (10%)  256Mi (6%)       256Mi (6%)
  monitoring                       prometheus-node-exporter-8n285                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  monitoring                       prometheus-server-5997957687-wlc4w                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  t-1                              t-1-5559cc964d-w9hlt                                               250m (12%)    250m (12%)  700Mi (18%)      700Mi (18%)
  t-2                              t-2-697fdf56d8-6xrzx                                               250m (12%)    250m (12%)  700Mi (18%)      700Mi (18%)
  t-3                              t-3-697896bf47-7fwmh                                               250m (12%)    250m (12%)  700Mi (18%)      700Mi (18%)
  t-4                              t-4-65c7d884b-66b46                                                250m (12%)    250m (12%)  700Mi (18%)      700Mi (18%)
  t-5                              t-5-66f49688cd-nktgl                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)
  tr-6                             tr-6-7bf488684b-7ct4r                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits   Memory Requests  Memory Limits
  ------------  ----------   ---------------  -------------
  1310m (65%)   1200m (60%)  3056Mi (80%)     3056Mi (80%)
Events:         <none>

Journalctl

Oct 30 13:19:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:19:10 ip-10-0-1-121.eu-west-1.compute.internal dhclient[2676]: XMT: Solicit on eth0, interval 108540ms.
Oct 30 13:19:16 ip-10-0-1-121.eu-west-1.compute.internal kubelet[3629]: E1030 13:19:16.255590    3629 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-10-0-1-121.eu-west-1.compute.internal": Unauthorized
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 25993 (agent)
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:19:17 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 25993 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (agent)
Oct 30 13:20:47 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:20:59 ip-10-0-1-121.eu-west-1.compute.internal dhclient[2676]: XMT: Solicit on eth0, interval 130410ms.
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 29006 (agent)
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: proc-sys-fs-binfmt_misc.automount's automount point already active?
Oct 30 13:21:02 ip-10-0-1-121.eu-west-1.compute.internal systemd[1]: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 29006 (agent)

ipamd.log

2018-11-09T12:35:08Z [DEBUG] IP Address Pool stats: total: 15, assigned: 1
2018-11-09T12:35:08Z [DEBUG] Reconciled IP 10.0.1.94 on eni eni-0c6011f1b1a05f13b
2018-11-09T12:35:08Z [DEBUG] Successfully Reconciled ENI/IP pool
2018-11-09T12:35:11Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-cypress-2115-xmmkr
2018-11-09T12:35:12Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-mobile-2115-75jr5
2018-11-09T12:35:12Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-functional-fast-2115-wv5gt
2018-11-09T12:35:13Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-functional-slow-2115-rztn4
2018-11-09T12:35:13Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:13Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:13Z [DEBUG] No ENI can be deleted at this time
2018-11-09T12:35:13Z [ERROR] Failed to decrease pool free ENI: no ENI can be deleted at this time
2018-11-09T12:35:13Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-nose-unit-2115-npbdg
2018-11-09T12:35:14Z [INFO]  Pods deleted on my node: t-feature-upate-local-tags/t-feature-upate-local-tags-pytest-2115-2wjj9
2018-11-09T12:35:15Z [INFO]  Add/Update for Pod prometheus-server-5997957687-wlc4w on my node, namespace = monitoring, IP = 10.0.1.136
2018-11-09T12:35:15Z [INFO]  Pods deleted on my node: monitoring/prometheus-server-5997957687-wlc4w
2018-11-09T12:35:18Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:18Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:18Z [DEBUG] No ENI can be deleted at this time
2018-11-09T12:35:18Z [ERROR] Failed to decrease pool free ENI: no ENI can be deleted at this time
2018-11-09T12:35:23Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:23Z [DEBUG] IP pool stats: total = 15, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:23Z [DEBUG] FreeENI: found a deletable ENI eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [INFO] FreeENI eni-0abc9a4f1747dbc0e: IP address pool stats: free 5 addresses, total: 10, assigned: 1
2018-11-09T12:35:23Z [DEBUG] Start freeing eni eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [INFO] Trying to free eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:23Z [DEBUG] Found eni eni-0abc9a4f1747dbc0e attachment id: eni-attach-0e57c7548fe9c7da3 
2018-11-09T12:35:23Z [DEBUG] Trying to delete eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:24Z [DEBUG] Not able to delete eni yet (attempt 1/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
	status code: 400, request id: 843583a7-3416-4d7b-b5fd-1393421a5f04 
2018-11-09T12:35:29Z [DEBUG] Not able to delete eni yet (attempt 2/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
	status code: 400, request id: 1501e10c-2026-4377-8d66-14630c3da533 
2018-11-09T12:35:34Z [DEBUG] Not able to delete eni yet (attempt 3/20): InvalidParameterValue: Network interface 'eni-0abc9a4f1747dbc0e' is currently in use.
	status code: 400, request id: 009a38d0-2824-4bb2-b707-6a31cece470f 
2018-11-09T12:35:39Z [INFO] Successfully deleted eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:39Z [INFO] Successfully freed eni: eni-0abc9a4f1747dbc0e
2018-11-09T12:35:39Z [DEBUG] Successfully decreased IP Pool
2018-11-09T12:35:39Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:44Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:44Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:49Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-11-09T12:35:49Z [DEBUG] IP pool stats: total = 10, used = 1, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5```

@agcooke
Copy link
Contributor Author

agcooke commented Nov 14, 2018

So we are getting no help from AWS via support and here.

Issues:

  • My guess is that we have an old version of docker, older kernel. Seeing a kernel bug above.
  • The image does not have log rotation configured for docker by default, causing nodes to fill up with data.
  • The image has a default ulimit that means that when we use larger instance we run out of file descriptors.

Thus the AMI is not ready for production use! We will have to fork it like the 90 other people who are doing this.

Probably use: Lumslabs or AdvMicrogrid as a base.

Very poor support from AWS :( They told us that the image is setup for default workloads. We are not doing anything special... and we loose two nodes a day.

@micahhausler
Copy link
Member

@agcooke thanks for this report, and sorry for the poor support :( We'll take a look into this

@micahhausler
Copy link
Member

micahhausler commented Nov 14, 2018

@agcooke Can you email me your Support case ID and Account ID? I'll see what I can track down on my end. [email protected]

@agcooke
Copy link
Contributor Author

agcooke commented Nov 15, 2018

Thanks @micahhausler , I will reach out to you in email and when we figure out the solution let people know.

One thing I did not mention above is that we are using 64GB GP2 EBS volumes on the node.

It's very hand wavy, but it seems that the node that runs prometheus (which uses a GP2 PVC) flaps to NonReady and back every few days. We will investigate using IOPS volumes for instances host and PVC's to see if it helps resolve the issue. LumoLabs also use an ext4 filesystem and I think a separate volume for docker, that could help with the IO issues.

But I do not believe that IO is the root cause... We should not be seeing kernel level stack traces.

@joberdick
Copy link

I have seen this on 2 clusters in the last few weeks

@gflarity
Copy link

We were seeing this frequently. Our current theory is it's related to running out of space on the root filesystem which caused the kubelet to die silently.

@agcooke
Copy link
Contributor Author

agcooke commented Nov 16, 2018

@gflarity Great we will see if that is a cause, we are now running the v25 AMI and it should rotate logs better so the disks should not fill up too much.

We were having these issues on t2.large and t3.medium nodes.

We had a talk with SA's from AWS and they suggested that we try use c5.large or m4.large nodes.

  • We tried c5.large nodes yesterday, nodes go into NonReady almost immediately.
  • Busy testing m4.large images, boots OK, will let you know on how it progresses over the next few days.

Summary of the issues we are seeing:

  • Running out of file descriptors on nodes.
    • When we get more than +- 20 pods per node we get problems quickly.
  • IO goes high on Node
    • Could be connected to Network and/or type of EBS volumes/EC2 Instance types.
    • Solution could be to use separate volume for docker images.
  • Running out of IP addresses:
    • We create and delete pods constantly (mostly jobs, +- 50 pods per hour, not a lot) and then CNI-Plugin runs out of IP's for a node. We are running the latest 1.2.1 Amazon VPC CNI daemonset.

Will keep you up to date with the progress.

@agcooke
Copy link
Contributor Author

agcooke commented Nov 16, 2018

@joberdick If you can add some logs to help triage the root cause it would be great!

@gflarity
Copy link

gflarity commented Nov 16, 2018

@agcooke I don't think the rotation fix is in v25, it was just merged two days ago. I might be wrong but when I looked at the dates yesterday v25 was put out like 8 days ago.

When you log into the nodes, is the kubelet silent crashing (ie disappearing). We had this happen a couple days ago and the root fs was full so we think kubelet tried to log before it crashed and couldn't. It was pretty clear from the sys logs that the FS was full.

@max-rocket-internet
Copy link
Contributor

What does kubectl get events say? I think you should see a DiskPressure event before the node dies if disk space is the problem?

@gflarity
Copy link

gflarity commented Nov 16, 2018

@max-rocket-internet
(we believe lack of free space is this is the root cause of our previous Not Ready issues, but this particular one we're fairly certain of it)

2018-11-16 08:38:04,403245 INFO [1] [region_selectable.go:16] targeting region us-west-2
LAST SEEN   FIRST SEEN   COUNT   NAME                                                            KIND   SUBOBJECT   TYPE      REASON                SOURCE                                                  MESSAGE
15m         1h           14     <redacted>  Node               Warning   FreeDiskSpaceFailed   kubelet, ip-10-<redacted> (combined from similar events): failed to garbage collect required amount of images. Wanted to free 3592427929 bytes, but freed 0 bytes
17s         1h           16      <redacted>   Node               Warning   ImageGCFailed         kubelet, <redacted>   (combined from similar events): failed to garbage collect required amount of images. Wanted to free 3865643417 bytes, but freed 0 bytes

@gflarity
Copy link

gflarity commented Nov 16, 2018

The pattern we see is that there was no logging indicating why kubectl dies... not having space to log is a good excuse for that, and probably the root cause. Unfortunately our disk usage tracking wasn't setup properly so we can't go back and confirm this.

@joberdick
Copy link

@gflarity

I don't get any events when doing kubectl get events

kubectl describe node

ip-192-168-145-119.ec2.internal
Name: ip-192-168-145-119.ec2.internal
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t2.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
kubernetes.io/hostname=ip-192-168-145-119.ec2.internal
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Fri, 02 Nov 2018 22:39:57 -0400
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Tue, 13 Nov 2018 17:26:36 -0500 Fri, 02 Nov 2018 22:39:57 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 192.168.145.119
Hostname: ip-192-168-145-119.ec2.internal
Capacity:
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-2Mi: 0
memory: 4040024Ki
pods: 17
Allocatable:
cpu: 2
ephemeral-storage: 19316009748
hugepages-2Mi: 0
memory: 3937624Ki
pods: 17
System Info:
Machine ID: c1c0616b96eb44c3b7b0d094dcd9536e
System UUID: EC2CEF1E-2CFC-16F6-F994-9C24FC7754E2
Boot ID: 38401493-f278-4779-8186-f09cf3b39f5d
Kernel Version: 4.14.62-70.117.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ExternalID: i-0857bdc11bf82764b
ProviderID: aws:///us-east-1b/i-0857bdc11bf82764b
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits


default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-citadel-557f97f46d-dqbbd 10m (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-sidecar-injector-8d65456b6-6bqn7 10m (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-telemetry-58cf9fbbfb-prwgd 20m (1%) 0 (0%) 0 (0%) 0 (0%)
istio-system prometheus-7fcd5846db-89lph 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system aws-node-rkm2l 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-dns-64b69465b4-9ldd7 260m (13%) 0 (0%) 110Mi (2%) 170Mi (4%)
kube-system kube-proxy-jtlqm 100m (5%) 0 (0%) 0 (0%) 0 (0%)
kube-system tiller-deploy-778f674bf5-r7c4b 0 (0%) 0 (0%) 0 (0%) 0 (0%)
teama leads-api-75cdc7b454-cjwmt 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%)
teama loans-ui-847df86c67-m27s6 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%)
teamb sleep-8677f78569-j982c 10m (0%) 10m (0%) 10Mi (0%) 10Mi (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits


650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%)
Events:
PS C:\Temp\repo\EKSPOCv3> kubectl get node
NAME STATUS ROLES AGE VERSION
ip-192-168-145-119.ec2.internal NotReady 12d v1.10.3
ip-192-168-195-239.ec2.internal NotReady 12d v1.10.3
ip-192-168-87-144.ec2.internal NotReady 6d v1.10.3
PS C:\Temp\repo\EKSPOCv3> kubectl get node
NAME STATUS ROLES AGE VERSION
ip-192-168-145-119.ec2.internal NotReady 12d v1.10.3
ip-192-168-195-239.ec2.internal NotReady 12d v1.10.3
ip-192-168-87-144.ec2.internal NotReady 6d v1.10.3
PS C:\Temp\repo\EKSPOCv3> kubectl describe node ip-192-168-145-119.ec2.internal
Name: ip-192-168-145-119.ec2.internal
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t2.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
kubernetes.io/hostname=ip-192-168-145-119.ec2.internal
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Fri, 02 Nov 2018 22:39:57 -0400
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


OutOfDisk Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Tue, 13 Nov 2018 17:26:36 -0500 Fri, 02 Nov 2018 22:39:57 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Tue, 13 Nov 2018 17:26:36 -0500 Tue, 13 Nov 2018 17:27:20 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 192.168.145.119
Hostname: ip-192-168-145-119.ec2.internal
Capacity:
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-2Mi: 0
memory: 4040024Ki
pods: 17
Allocatable:
cpu: 2
ephemeral-storage: 19316009748
hugepages-2Mi: 0
memory: 3937624Ki
pods: 17
System Info:
Machine ID: c1c0616b96eb44c3b7b0d094dcd9536e
System UUID: EC2CEF1E-2CFC-16F6-F994-9C24FC7754E2
Boot ID: 38401493-f278-4779-8186-f09cf3b39f5d
Kernel Version: 4.14.62-70.117.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ExternalID: i-0857bdc11bf82764b
ProviderID: aws:///us-east-1b/i-0857bdc11bf82764b
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits


default sleep-8677f78569-6gvt5 0 (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-citadel-557f97f46d-dqbbd 10m (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-sidecar-injector-8d65456b6-6bqn7 10m (0%) 0 (0%) 0 (0%) 0 (0%)
istio-system istio-telemetry-58cf9fbbfb-prwgd 20m (1%) 0 (0%) 0 (0%) 0 (0%)
istio-system prometheus-7fcd5846db-89lph 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system aws-node-rkm2l 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-dns-64b69465b4-9ldd7 260m (13%) 0 (0%) 110Mi (2%) 170Mi (4%)
kube-system kube-proxy-jtlqm 100m (5%) 0 (0%) 0 (0%) 0 (0%)
kube-system tiller-deploy-778f674bf5-r7c4b 0 (0%) 0 (0%) 0 (0%) 0 (0%)
teama leads-api-75cdc7b454-cjwmt 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%)
teama loans-ui-847df86c67-m27s6 110m (5%) 1 (50%) 10Mi (0%) 10Mi (0%)
teamb sleep-8677f78569-j982c 10m (0%) 10m (0%) 10Mi (0%) 10Mi (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits


650m (32%) 2010m (100%) 140Mi (3%) 200Mi (5%)
Events:

Unfortunately, this environment does not have a key pair setup for EC2 so the info that I will be able gather will be limited.

Any suggestions on what else to check without without being on the host? I am going to try autoscaling after adding a key pair, but was avoiding touching anything until i hear from aws support

@gflarity
Copy link

@joberdick

/shrug there might be multiple causes of this...

Can you ssh into the node? That's how I first diagnosed our issue.

@agcooke
Copy link
Contributor Author

agcooke commented Nov 22, 2018

We have been running m4.large and are not seeing the issues anymore.

I think the symptoms are hidden by faster network throughput.

@gacopl
Copy link

gacopl commented Nov 29, 2018

running m4.xlarge and now nodes are dying like crazy....... before it was one a week now it's 3 today already ofc no possibility to ssh and Amazon status checks are saying node is ok (sic!)

@nonsense
Copy link

@gacopl We are seeing similar behaviour on m4.xlarge as well. Our guess is that https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ might help, but this is not available on Kubernetes 1.10.3 on AWS. Waiting patiently for Kubernetes 1.11 and 1.12 on AWS EKS :/

@gacopl
Copy link

gacopl commented Nov 29, 2018

keep waiting we started around september and we are still on EKS v1 they did not even upgrade us to v2, choosing EKS was worst decision, the worst part is pods do not get resheduled when node NotReady, will have setup special healthchecks for ASG.....

@radityasurya
Copy link

@gacopl is it helpful to add additional healthchecks even if the node is still unstable?

@agcooke
Copy link
Contributor Author

agcooke commented Nov 29, 2018

Since the last two days we are also seeing the m4.large nodes dying all the time...

We are resorting to using another AMI, but do not have results for that yet.

nodes

We had relative stability and now things are bad again.

@gacopl
Copy link

gacopl commented Nov 29, 2018

@radityasurya what else do you advise since only terminating node gets pod rescheduled and basic ASG checks do not see the node is unreachable

@agcooke which AMI specifically ?

@agcooke
Copy link
Contributor Author

agcooke commented Nov 29, 2018

We are using amazon-eks-node-v25 (ami-00c3b2d35bddd4f5c) and I have started working on using https://github.com/lumoslabs/amazon-eks-ami AMI.

If anyone know how to use a CoreOS AMI or an Ubuntu (conjure-up) AMI it would be great...

@gacopl
Copy link

gacopl commented Nov 29, 2018

@agcooke So v25 does not solve it right ? i was about to upgrade from v23

@agcooke
Copy link
Contributor Author

agcooke commented Nov 29, 2018

No, just found https://cloud-images.ubuntu.com/docs/aws/eks/

@eswarbala
Copy link

We are working on an AMI release first week of December with the log rotation patch in. That should fix the issues related to disk filling up due to logs.

@gacopl
Copy link

gacopl commented Nov 29, 2018

@eswarbala It's not only logs but also fd limits are too low on bigger instances

@kinghajj
Copy link

@agcooke

We have an Ubuntu 18.04 AMI forked from this one. Been using it for a prototype cluster, and haven't noticed this issue.

@rakeshpatri
Copy link

rakeshpatri commented Mar 18, 2019

I have been seeing the same issue. we are on t2-medium and one node goes "NotReady" sporadically. The only solution I had was to detach and spawn another node. restart does not work sigh any interim solution for this? @agcooke / others is your fork available somewhere? clearly aws has no time for this one! 😠

We have setup cluster Autoscaler as well as have set resource limits on all the deployments in kubernetes. Also use HPA for your deployment that is consuming more resource. By applying these changes we have not faced this issue anymore.

@tckb
Copy link

tckb commented Mar 19, 2019

@agcooke were the change released? we are currently running amazon-eks-node-1.11-v20190220 and unfortunately, the issue remains

@whereisaaron
Copy link

@tckb I've seen it suggested some of these changes will be in the 1.11.8 AMI's which haven't yet been released, but should be quite soon.

@benjamin658
Copy link

Our EKS cluster(1.11) with AMI (ami-0f54a2f7d2e9c88b3) facing the same issue randomly, and it kills my production services many times per day.

I was wondering if I upgraded the EKS cluster to 1.12 and using the latest AMI ami-0923e4b35a30a5f53 could solve this problem. (follow these steps https://docs.aws.amazon.com/eks/latest/userguide/update-stack.html)

@pod2metra
Copy link

Same issue no Server Version: v1.12.6-eks-d69f1b and AMI ami-0abcb9f9190e867ab.

@avielb
Copy link

avielb commented Apr 26, 2019

same issue here.
running with eks 1.12 and latest AMIs on us-east-1

@benjamin658
Copy link

It seems to be caused by the "Out of Memory" error on the kubelet host.
After add the BootstrapArguments to the cloudformation, NotReady state is no longer happening.

螢幕快照 2019-04-29 下午3 39 13

Here is my BootstrapArguments:

--kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.2Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%"

@tckb
Copy link

tckb commented Apr 29, 2019

@benjamin658/ others can you confirm this? I did not see any of such errors in logs

@benjamin658
Copy link

@benjamin658/ others can you confirm this? I did not see any of such errors in logs

Im not 100 percent sure, but after I added the BootstrapArguments, our cluster is working well now.

@dijeesh
Copy link

dijeesh commented May 1, 2019

Having the same issue.

EKS v1.12.6-eks-d69f1b
AMI ami-0abcb9f9190e867ab

@tckb
Copy link

tckb commented May 2, 2019

@dijeesh did you try the suggestions from @benjamin658

@AmazingTurtle
Copy link

AmazingTurtle commented May 15, 2019

I'm experiencing the same issues. The problems started when I installed gitlab-runner using helm and spawned ~20 jobs in a hour or so.

Nodes running v1.12.7 on AMI ami-0d741ed58ca5b342e

I have weavescope installed in my cluster and when looking at Hosts/Resources I see many containers named svc-0 (which are coming from gitlab). They are docker containers that should have been deleted (and somehow.. they are, because when I search for them on the nodes using the docker CLI they are gone. kubernetes provides no further information as well). That might be a weavescope bug, but if not: this might be a hint on the node NotReady issues

edit: ran into a CNI issue as well (network addresses per host exhausted, reached pod limit aka "insufficient pods")
see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI
In my particular case I was using t3.small instances (3 interfaces x 4 ips each = 12 addresses of which 11 are assignable). This might also be a cause for changing a node status to NotReady

@whereisaaron
Copy link

I thought reserving resources for kubelet was default/built-in behavior of current k8s, but sounds like it is optional and EKS doesn’t do it 😢

The reserved resource for kubelets is extremely important where you use overcommitted workloads (collections of spikey workloads) i.e. any time where resource Limits >= Requests or is you don’t specify resource limits. Under node resource exhaustion you want some workloads to be rescheduled, not entire nodes to go down.

If you are using small nodes, failures like this will be more common. Plus you have the low EKS pod limit caused by ENI limitations. I’d suggest reserving some system resources on each node, and use fewer, larger nodes.

eksctl-io/eksctl#795

@bobbui
Copy link

bobbui commented Jul 5, 2019

this still happens on EKS 1.13. It started to happen when cluster running under some really high load.

@montanaflynn
Copy link

Happening to me as well, looking at kubectl get node <name> -o=yaml I see taints:

spec:
  providerID: aws:///us-east-1a/i-07b8613b0ed988d73
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: 2019-07-30T07:11:15Z

@montanaflynn
Copy link

I think this might be related? eksctl-io/eksctl#795

@JamesDowning
Copy link

We are seeing similar behaviour, what appears to be almost random/possibly coincides with a deployment. A node or two will suddenly appear to be NotReady, resource graphs indicate utilisation is hardly over 50% so oom shouldn't be an issue.

As mentioned by @AmazingTurtle we are also on 4-5 t3.small nodes with around 50 pods so we may be seeing effects of exhausted network addresses despite not seeing these logs.

In line with @montanaflynn the node has the following taints suddenly applied:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  MemoryPressure   Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  PIDPressure      Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  Ready            Unknown   Fri, 29 Nov 2019 14:18:03 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusUnknown        Kubelet stopped posting node status.
  OutOfDisk        Unknown   Fri, 29 Nov 2019 14:03:02 +0000   Fri, 29 Nov 2019 14:18:48 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.
Kubernetes version
1.13
Platform version
eks.6

I'm going to try increasing node size and adding some resource limits to deployments that may not have them correctly configured.

@AmazingTurtle
Copy link

@JamesDowning
Unless you're considering to run production deployments on azure, you should go with any other service. Even self-hosted would be better that AWS EKS IMHO. It's just expensive as hell.

Have you had a look on Fargate? Using the CDK you can bootstrap apps so easily without worrying about infrastructure. It just runs. I gave it a shot a couple of days ago and it's just sexy and works the way I want it --- not like EKS.

@jontro
Copy link

jontro commented May 20, 2020

Im getting this on a t3.small node

spec:
  providerID: aws:///eu-west-1b/i-<redacted>
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-05-20T11:36:58Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2020-05-20T11:37:03Z"

What is adding these taints and will they ever get removed?

@brsolomon-deloitte
Copy link

Seeing this on Amazon EKS 1.26.

@seyal84
Copy link

seyal84 commented Jun 23, 2023

What was the resolution this issue as this still persists on eks v1.23?

@dimittal
Copy link

We are facing this issue also on a daily basis. Any resolution for this?

@cartermckinnon
Copy link
Member

@dimittal this can happen for many reasons, please open a new issue with details of your environment and symptoms.

@awslabs awslabs locked as off-topic and limited conversation to collaborators Jul 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests