Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

itskingori · 2017-07-13T20:59:54Z

Versions of kops:

$ kops version
Version 1.6.2

Version of kubernetes

$ kubectl version | grep "Server"
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Problem

I have used kops+kubernetes since 1.4.7 without issues. I'm struggling with an unstable cluster and the exhibited behaviours simply do not make sense.

The first issue

I'm experiencing cases where a node is terminated and replaced. Sometimes a master, sometimes a minion (but mostly a minion). We don't have a fix for this nor do we know why the node is being rotated.

If a master is rotated we get a lot of Unknown status pods and the everything just goes berzerk across the cluster.

These issue may be related:

The second issue

I'm experiencing cases where pods keep locked in a state of transition. By state of transition I mean that if a pod was Terminating, it stays stuck at that ... and if a node was ContainerCreating it stays stuck at that

$ kubectl get pods -o wide --namespace=$ENVIRONMENT --no-headers
grafana-3323403255-kzg2w                      1/2       CrashLoopBackOff    8         1h
grafana-mysql-0                               0/1       ContainerCreating   0         3m

I noticed that pods that are stuck in this way have dead containers. When we have dead containers /var/log/syslog is full of these:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}
Jul  8 09:22:04 ip-10-83-59-150 dockerd[1358]: time="2017-07-08T09:22:04.992595490Z" level=error msg="Handler for GET /v1.24/containers/331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f/json returned error: open /var/lib/docker/overlay/69353e3da8b9d11a16ee77343d8aa0208ca33bb5d66b300bfb9ea9e997e6d1ea/lower-id: no such file or directory"
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992798   13520 remote_runtime.go:273] ContainerStatus "331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992827   13520 kuberuntime_container.go:385] ContainerStatus for 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f error: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992840   13520 kuberuntime_manager.go:858] getPodContainerStatuses for pod "grafana-2597417898-prpwf_sandbox(fe4927e6-6345-11e7-8ca3-0e5eca502a9e)" failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992858   13520 generic.go:239] PLEG: Ignoring events for pod grafana-2597417898-prpwf/sandbox: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f

If I clean them (dead containers) out with the below command ... they are able to proceed (/var/log/syslog is not devoid of the aforementioned errors).

$ docker rm $(docker ps -a | grep "Dead" | awk '{print $1}')

To get by I've created a cronjob to do this for me every minute ...

root@ip-xx-xx-xxx-xxx:/# cat ./root/scripts/docker-cleanup.sh
#!/bin/bash
docker rm $(docker ps -a | grep "Dead" | awk '{print $1}') &>/dev/null
true

$ crontab -l
*/1 * * * * ./root/scripts/docker-cleanup.sh

These issue may be related:

Way Forward

This is happening very often ... and I'm willing to help get to the bottom of it. Unsure what I need to provide to give more context so just let me know what I need to check and what logs I need to provide.

The text was updated successfully, but these errors were encountered:

itskingori · 2017-07-14T03:13:41Z

@bboreham I'm wondering if you could weigh in on the second issue. I'm inclined to think that weave is somehow involved here because of the errors we're getting in syslog from kubelet i.e. CNI errors.

And before you ask, we're using weave 1.9.4.

kaazoo · 2017-07-14T11:15:08Z

I'm using kops 1.6.2 and upgraded a cluster from k8s 1.6.2 to 1.6.7. The cluster is using Calico.
No issues so far.

bboreham · 2017-07-14T13:36:11Z

I'm inclined to think that weave is somehow involved here because of the errors we're getting in syslog from kubelet i.e. CNI errors.

I see exactly one error mentioning CNI:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}

This is a message from kubelet basically saying the container process was dead when it tried to check on it. And all the other messages are along the same lines. Looks to me like your Docker is very unhappy, but I have no idea why.

If there are other messages you wanted me to look at please clarify.

itskingori · 2017-07-18T16:52:18Z

This is no longer an issue for me ... I believe the steps I've taken in #2982 (comment) have alleviated the problem.

itskingori · 2017-07-25T18:52:07Z

This, unfortunately is still an issue. 😰

Re-opening.

chrislovecnm · 2017-07-25T20:44:09Z

Did the changes with kubelet help at all?

itskingori · 2017-07-26T08:19:13Z

@chrislovecnm to clarify:

Giving headroom to resources via requests and limits made a very big difference in cluster stability.
I have not yet applied the flags ... I created headroom by over provisioning requests so that the cluster always has excess cpu and memory. I still intend to set them ... just have been distracted by current instability issues at the moment.
Every now and then I lose a node because it fails instance checks (on AWS side) and the ASG replaces it.

To summarise ... these are the issue I'm trying to solve (I don't know if they are related to my cluster's instability):

k8s reports pod as "Terminated: Error" with "Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container" kubernetes#45626 - k8s reports pod as "Terminated: Error" with "Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container"
kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" moby/moby#5618 (comment) - kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" · Issue weave: bump version for 2.3.0 #5618 · moby/moby
Random node termination (as described above).

At the moment ... I'm investigating a kernel panic and trying to set up kernel dumps using kdump.

itskingori · 2017-07-26T08:36:09Z

@chrislovecnm another reason I haven't used the flags is because I'm still on kops 1.6.2 and those features will probably be in kops 1.7.x.

itskingori · 2017-07-28T14:08:49Z

@chrislovecnm reporting back ... using a AMI with 4.4.78 has solved no. 2 and no. 3 listed in #2928 (comment).

Changed AMI on sandbox cluster yesterday early morning. You can see I used to get 1 termination per day on average. No more node terminations👇

And the same with the "unregister_netdevice: waiting for lo to become free" issue ... 👇

chrislovecnm · 2017-07-28T18:23:55Z

Headroom still is not fixed?

itskingori · 2017-07-28T21:25:03Z

@chrislovecnm I'll be doing that next week. Have been busy investigating/addressing cluster instability issues. Will keep you posted.

3h4x · 2017-10-13T14:41:51Z

I have experienced similar issue:
kops 1.6.2
k8s 1.6.6

We don't yet know why the containers were dead. What should we look at? Our cluster has plenty of unused cpu and memory.

chrislovecnm · 2017-10-13T22:57:20Z

You neee to be using kops 1.7.1 with k8s 1.6.6 - please report on how it works!

3h4x · 2017-10-14T07:56:39Z

@chrislovecnm Thanks for the tip. Is it somwhere in kops docs?

itskingori · 2017-10-14T17:16:34Z

Closing this because a solution was found even thought it's not quite explainable.

itskingori mentioned this issue Jul 18, 2017

Add reserve compute resources kubelet flags #2982

Merged

itskingori closed this as completed Jul 18, 2017

itskingori reopened this Jul 25, 2017

itskingori mentioned this issue Jul 27, 2017

Required a new linux image with Kernel version 4.12 #2901

Closed

itskingori mentioned this issue Aug 12, 2017

k8s reports pod as "Terminated: Error" with "Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container" kubernetes/kubernetes#45626

Closed

itskingori closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

itskingori commented Jul 13, 2017

itskingori commented Jul 14, 2017

kaazoo commented Jul 14, 2017

bboreham commented Jul 14, 2017

itskingori commented Jul 18, 2017

itskingori commented Jul 25, 2017

chrislovecnm commented Jul 25, 2017

itskingori commented Jul 26, 2017

itskingori commented Jul 26, 2017

itskingori commented Jul 28, 2017

chrislovecnm commented Jul 28, 2017

itskingori commented Jul 28, 2017

3h4x commented Oct 13, 2017

chrislovecnm commented Oct 13, 2017

3h4x commented Oct 14, 2017

itskingori commented Oct 14, 2017

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

Comments

itskingori commented Jul 13, 2017

Versions of kops:

Version of kubernetes

Problem

The first issue

The second issue

Way Forward

itskingori commented Jul 14, 2017

kaazoo commented Jul 14, 2017

bboreham commented Jul 14, 2017

itskingori commented Jul 18, 2017

itskingori commented Jul 25, 2017

chrislovecnm commented Jul 25, 2017

itskingori commented Jul 26, 2017

itskingori commented Jul 26, 2017

itskingori commented Jul 28, 2017

chrislovecnm commented Jul 28, 2017

itskingori commented Jul 28, 2017

3h4x commented Oct 13, 2017

chrislovecnm commented Oct 13, 2017

3h4x commented Oct 14, 2017

itskingori commented Oct 14, 2017