Nodes stuck in NotReady leaves pods in Pending state (does not autoscale) #37995

jeremywadsack · 2016-12-02T23:45:33Z

Is this a request for help?: No

(I submitted a help request through Google Support who did some of the research below but stated that "This is an issue that need to be address by Kubernetes engineers.")

What keywords did you search in Kubernetes issues before filing this one? "notready" "autoscale"

4135 discusses similar problems with out-of-disk errors, but ours were related to out-of-memory which is configurable on nodes.
34772 is related to a race condition with scheduling; my issue has to do with node state.

BUG REPORT:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.4", 
GitCommit:"3b417cc4ccd1b8f38ff9ec96bb50a81ca0ea9d56", GitTreeState:"clean", BuildDate:"2016-10-21T02:48:38Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux report-3312827547-hony0 4.4.21+ Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Thu Nov 10 21:43:53 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
Install tools: gcloud
Others:

What happened:

We have nodes that stop posting their node status back to kubernetes.

$ kubectl describe node gke-keylime-toolbox-highmem-pool-b5e681ff-fyyt
...
Wed, 16 Nov 2016 03:05:57 -0800 NodeStatusUnknown Kubelet stopped posting node status.
...

This leaves the node is a NotReady state which means that pods cannot be scheduled on it.

$ kubectl get nodes
NAME                                             STATUS     AGE
gke-keylime-toolbox-highmem-pool-b5e681ff-fyyt   NotReady   6d

$ kubectl get pods -a | grep -v Running 
NAME                                                         READY     STATUS    RESTARTS   AGE
report-file-1917812907-609xq          0/1       Pending   0          7h
reprocess-2942535581-pa5gj            0/1       Pending   0          6h
processing-2495744680-liqfk                              0/1       Pending   0          6h
mailers-1884379191-7bgb6               0/1       Pending   0          6h

Our cluster is set up with two node groups, both of which are configured for autoscale. However, because the node exists, autoscaling won't add a new node (in either group). Because the node is stuck in "NotReady" state, kubernetes can't schedule any pods on it.

This leaves us in a situation where we have pods that are waiting to be scheduled.

Trying to SSH into the node just spins while "Establishing connection to SSH server". I've let this try for over an hour and it won't connect. The only solution I have to resolve this is to reset the node.

In investigation with Google Support we determined that the node had reached and OOM condition that appeared to crash kubelet (or something). The solution Google Support suggested was to set memory limits on every container.

We set memory limits on most of our containers, but continue to see this issue.

What you expected to happen:
Setting a memory limit on all containers feels counterproductive to me as I would expect that if kubernetes could fail if the system runs out of memory that it would protect against that (i.e kill any container that is exceeding available memory on the node or something).

Additionally, if a node stops responding I expect that to be a different state than a node that is starting up. So when the nodes are "NotReady" and they have stopped responding, the autoscaler will spin up new nodes to satisfy the "Pending" pod requirements.

(If you want me to split this into two issues, let me know.)

How to reproduce it (as minimally and precisely as possible):
I tried to build a test cluster. It doesn't seem to crash the nodes though, so something more complicated than my "use all the memory" script might be necessary.

Manifests for the below are in this gist.

Spin up an autoscale cluster and load a deployment with replicas that requires the cluster to grow.

$ gcloud alpha container clusters create test-not-ready --num-nodes=1 --enable-autoscaling --min-nodes=1 --max-nodes=5 --machine-type g1-small
Creating cluster test-not-ready...done.
$ kubectl apply -f nginx-deployment.yaml 
deployment "nginx-deployment" created
$ kubectl get pods
NAME                                READY     STATUS              RESTARTS   AGE
nginx-deployment-3818977466-3rx74   0/1       ContainerCreating   0          6s
nginx-deployment-3818977466-847me   0/1       Pending             0          6s

Wait for the cluster to add a new node and schedule the pod.

The spin up a pod that will consume all the memory on a node.

$ kubectl delete deployment all-memory-deployment
$ kubectl top nodes
NAME                                            CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
gke-test-not-ready-default-pool-b04036fd-q9ay   61m          6%        763Mi           44%       
gke-test-not-ready-default-pool-b04036fd-cbi5   986m         98%       856Mi           50%

Wait for the node to run out of memory and crash.

The text was updated successfully, but these errors were encountered:

wstrange · 2017-01-04T21:32:00Z

I'm seeing something similar with auto scaling, on gke / kube 1.5.1

In my case the new autoscaled node eventually becomes non responsive and enters into a NotReady state.

I can't even ssh into the node from the cloud console - it appears to hang. The serial port output shows nothing of interest.

I am using PVCs. I have a suspicion this may be related to attach / detach of pvc disks

If I reset the node in the cloud console, the cluster eventually seems to recover, and I can ssh into the node

omerzach · 2017-01-10T05:03:58Z

I get the same thing with GKE with nodes on version 1.4.7 but without autoscaling. Every couple days as my CI system is updating the image on my deployments I notice my new pods can't be scheduled, my old pods are gone, and 2 of my 3 nodes are NotReady.

0xmichalis · 2017-06-25T16:47:11Z

/sig node

fejta-bot · 2017-12-30T12:48:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-29T12:57:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-28T13:42:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 25, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 25, 2017

0xmichalis added area/provider/gcp Issues or PRs related to gcp provider needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 25, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 25, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2018

k8s-ci-robot closed this as completed Feb 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes stuck in NotReady leaves pods in Pending state (does not autoscale) #37995

Nodes stuck in NotReady leaves pods in Pending state (does not autoscale) #37995

jeremywadsack commented Dec 2, 2016

wstrange commented Jan 4, 2017 •

edited

Loading

omerzach commented Jan 10, 2017

0xmichalis commented Jun 25, 2017

fejta-bot commented Dec 30, 2017

fejta-bot commented Jan 29, 2018

fejta-bot commented Feb 28, 2018

Nodes stuck in NotReady leaves pods in Pending state (does not autoscale) #37995

Nodes stuck in NotReady leaves pods in Pending state (does not autoscale) #37995

Comments

jeremywadsack commented Dec 2, 2016

wstrange commented Jan 4, 2017 • edited Loading

omerzach commented Jan 10, 2017

0xmichalis commented Jun 25, 2017

fejta-bot commented Dec 30, 2017

fejta-bot commented Jan 29, 2018

fejta-bot commented Feb 28, 2018

wstrange commented Jan 4, 2017 •

edited

Loading