Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Kubernetes workers fail after a few hours #306

Closed
alexkappa opened this issue Mar 12, 2016 · 6 comments
Closed

Kubernetes workers fail after a few hours #306

alexkappa opened this issue Mar 12, 2016 · 6 comments

Comments

@alexkappa
Copy link

Hello,

Please excuse me if this is the wrong place for an issue.

I've been having this issue for quite a while trying to bring up a kubernetes cluster in AWS. I believe I've followed the manual very closely and I'm starting to run out of things to look for.

I'm using an auto scaling group for nodes, which are placed in three subnets.

When I bring the cluster up, I see that everything runs normally, I create my services and pods and all is fine. After a some time, worker nodes start disappearing from kubectl get nodes. The master has never failed so far.

You can find the kubelet systemd units in the gist below along with the aws system log of two nodes. I'd be happy to share more information if need be, but I would like to mask some information first.

https://gist.github.com/alexkappa/71f7fbdc566d5cedb318

I've looked for a common issue everywhere and my searching has led me to:

kubernetes/kubernetes#20096
moby/moby#5618
moby/moby#20871
coreos/bugs#965

But even after upgrading the cluster to v1.2.0-beta.0 and setting the --hairpin-mode to promiscuous-bridge and none the problem was not fixed.

I was using version v1.1.8 prior to that.

Any help is greatly appreciated!

@gopinatht
Copy link

@alexkappa Are you still able to SSH into the disappearing nodes? If so, do you see docker hang when you run docker ps?

@alexkappa
Copy link
Author

I was monitoring it today until I observe the problem, you can find the journal log in this gist.

I can't remember if I tried docker ps to be honest, I believe I tried it another time and it was hanging. I can try it next time if you like to be sure.

Towards the end of the log, I notice an oom-killer, could it be that some of the services get killed? My cloudwatch metrics don't show that memory is all that high (~35%).

@gopinatht
Copy link

Yeah. If it is a docker ps issue similar to what I a facing, we are basically stuck as i could not find a definitive answer as to what is causing the problem. Having said that, I am now using k8s version 1.2.0-beta.0 on CoreOS alpha version 976.0.0 and the cluster is a lot more stable (I left the hairpin mode as the default --hairpin-mode="promiscuous-bridge").

I am not sure the problem has gone away though (we have not stress tested it yet). The docker ps hang seems to a symptom of a basket of issues and I could not find any consensus on what the underlying cause is (just conjecture that the problem is in the kernel).

@alexkappa
Copy link
Author

I think I figured out what was causing nodes to fail, and evrything points to a container using up all the resources, causing the kernels oom killer to take effect, killing processes randomly.

What I recommend is having default limits per namespace, to prevent containers from drowning your nodes. Ever since I set limits, I notice two elasticsearch pods in status OOMKilled and restarting. The issue hasn't occurred since.

Hope this helps!

@gopinatht
Copy link

@alexkappa If your observation pans out, it is probably a vital breakthrough in understanding this problem (the docker ps hang). Thanks for sharing.

@aaronlevy
Copy link
Contributor

This (and docker ps hang) sounds like resource exhaustion.

We have an open issue to bump the default memory limits on the vagrant examples (#311), but in this case it would be up to the operator to ensure that the pods being scheduled do not exhaust available resources on the nodes. See http://kubernetes.io/docs/user-guide/compute-resources/ for more info.

I'm going to close this for now, but if we can provide additional assistance, please let us know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants