Kubernetes workers fail after a few hours #306

alexkappa · 2016-03-12T07:17:28Z

Hello,

Please excuse me if this is the wrong place for an issue.

I've been having this issue for quite a while trying to bring up a kubernetes cluster in AWS. I believe I've followed the manual very closely and I'm starting to run out of things to look for.

I'm using an auto scaling group for nodes, which are placed in three subnets.

When I bring the cluster up, I see that everything runs normally, I create my services and pods and all is fine. After a some time, worker nodes start disappearing from kubectl get nodes. The master has never failed so far.

You can find the kubelet systemd units in the gist below along with the aws system log of two nodes. I'd be happy to share more information if need be, but I would like to mask some information first.

https://gist.github.com/alexkappa/71f7fbdc566d5cedb318

I've looked for a common issue everywhere and my searching has led me to:

kubernetes/kubernetes#20096
moby/moby#5618
moby/moby#20871
coreos/bugs#965

But even after upgrading the cluster to v1.2.0-beta.0 and setting the --hairpin-mode to promiscuous-bridge and none the problem was not fixed.

I was using version v1.1.8 prior to that.

Any help is greatly appreciated!

The text was updated successfully, but these errors were encountered:

gopinatht · 2016-03-14T15:23:39Z

@alexkappa Are you still able to SSH into the disappearing nodes? If so, do you see docker hang when you run docker ps?

alexkappa · 2016-03-14T19:46:12Z

I was monitoring it today until I observe the problem, you can find the journal log in this gist.

I can't remember if I tried docker ps to be honest, I believe I tried it another time and it was hanging. I can try it next time if you like to be sure.

Towards the end of the log, I notice an oom-killer, could it be that some of the services get killed? My cloudwatch metrics don't show that memory is all that high (~35%).

gopinatht · 2016-03-14T20:23:41Z

Yeah. If it is a docker ps issue similar to what I a facing, we are basically stuck as i could not find a definitive answer as to what is causing the problem. Having said that, I am now using k8s version 1.2.0-beta.0 on CoreOS alpha version 976.0.0 and the cluster is a lot more stable (I left the hairpin mode as the default --hairpin-mode="promiscuous-bridge").

I am not sure the problem has gone away though (we have not stress tested it yet). The docker ps hang seems to a symptom of a basket of issues and I could not find any consensus on what the underlying cause is (just conjecture that the problem is in the kernel).

alexkappa · 2016-03-15T18:15:52Z

I think I figured out what was causing nodes to fail, and evrything points to a container using up all the resources, causing the kernels oom killer to take effect, killing processes randomly.

What I recommend is having default limits per namespace, to prevent containers from drowning your nodes. Ever since I set limits, I notice two elasticsearch pods in status OOMKilled and restarting. The issue hasn't occurred since.

Hope this helps!

gopinatht · 2016-03-15T20:58:32Z

@alexkappa If your observation pans out, it is probably a vital breakthrough in understanding this problem (the docker ps hang). Thanks for sharing.

aaronlevy · 2016-03-31T18:58:39Z

This (and docker ps hang) sounds like resource exhaustion.

We have an open issue to bump the default memory limits on the vagrant examples (#311), but in this case it would be up to the operator to ensure that the pods being scheduled do not exhaust available resources on the nodes. See http://kubernetes.io/docs/user-guide/compute-resources/ for more info.

I'm going to close this for now, but if we can provide additional assistance, please let us know.

gopinatht mentioned this issue Mar 15, 2016

docker ps command hangs on CoreOS with Docker version 1.10.1 and overlay storage driver moby/moby#20871

Closed

aaronlevy added platform/AWS kind/support labels Mar 31, 2016

aaronlevy closed this as completed Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes workers fail after a few hours #306

Kubernetes workers fail after a few hours #306

alexkappa commented Mar 12, 2016

gopinatht commented Mar 14, 2016

alexkappa commented Mar 14, 2016

gopinatht commented Mar 14, 2016

alexkappa commented Mar 15, 2016

gopinatht commented Mar 15, 2016

aaronlevy commented Mar 31, 2016

Kubernetes workers fail after a few hours #306

Kubernetes workers fail after a few hours #306

Comments

alexkappa commented Mar 12, 2016

gopinatht commented Mar 14, 2016

alexkappa commented Mar 14, 2016

gopinatht commented Mar 14, 2016

alexkappa commented Mar 15, 2016

gopinatht commented Mar 15, 2016

aaronlevy commented Mar 31, 2016