-
Notifications
You must be signed in to change notification settings - Fork 465
Kubernetes workers fail after a few hours #306
Comments
@alexkappa Are you still able to SSH into the disappearing nodes? If so, do you see docker hang when you run |
I was monitoring it today until I observe the problem, you can find the journal log in this gist. I can't remember if I tried Towards the end of the log, I notice an |
Yeah. If it is a I am not sure the problem has gone away though (we have not stress tested it yet). The |
I think I figured out what was causing nodes to fail, and evrything points to a container using up all the resources, causing the kernels oom killer to take effect, killing processes randomly. What I recommend is having default limits per namespace, to prevent containers from drowning your nodes. Ever since I set limits, I notice two elasticsearch pods in status OOMKilled and restarting. The issue hasn't occurred since. Hope this helps! |
@alexkappa If your observation pans out, it is probably a vital breakthrough in understanding this problem (the |
This (and We have an open issue to bump the default memory limits on the vagrant examples (#311), but in this case it would be up to the operator to ensure that the pods being scheduled do not exhaust available resources on the nodes. See http://kubernetes.io/docs/user-guide/compute-resources/ for more info. I'm going to close this for now, but if we can provide additional assistance, please let us know. |
Hello,
Please excuse me if this is the wrong place for an issue.
I've been having this issue for quite a while trying to bring up a kubernetes cluster in AWS. I believe I've followed the manual very closely and I'm starting to run out of things to look for.
I'm using an auto scaling group for nodes, which are placed in three subnets.
When I bring the cluster up, I see that everything runs normally, I create my services and pods and all is fine. After a some time, worker nodes start disappearing from
kubectl get nodes
. The master has never failed so far.You can find the kubelet systemd units in the gist below along with the aws system log of two nodes. I'd be happy to share more information if need be, but I would like to mask some information first.
https://gist.github.com/alexkappa/71f7fbdc566d5cedb318
I've looked for a common issue everywhere and my searching has led me to:
kubernetes/kubernetes#20096
moby/moby#5618
moby/moby#20871
coreos/bugs#965
But even after upgrading the cluster to v1.2.0-beta.0 and setting the
--hairpin-mode
topromiscuous-bridge
andnone
the problem was not fixed.I was using version v1.1.8 prior to that.
Any help is greatly appreciated!
The text was updated successfully, but these errors were encountered: