Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node NotReady status with "Kubelet stopped posting node status error" #34314

Closed
axsuul opened this issue Oct 7, 2016 · 25 comments
Closed

Node NotReady status with "Kubelet stopped posting node status error" #34314

axsuul opened this issue Oct 7, 2016 · 25 comments

Comments

@axsuul
Copy link

axsuul commented Oct 7, 2016

On k8s 1.4 and used kubeadm to provision cluster:

I have node and master on same server. Suddenly by node is posting a NotReady status. Running a

# kubectl describe node <NODE>

returns

Name:                   operate
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubeadm.alpha.kubernetes.io/role=master
                        kubernetes.io/hostname=operate
Taints:                 <none>
CreationTimestamp:      Thu, 06 Oct 2016 23:57:52 +0000
Phase:
Conditions:
  Type                  Status          LastHeartbeatTime                       LastTransitionTime                      Reason           Message
  ----                  ------          -----------------                       ------------------                      ------           -------
  OutOfDisk             Unknown         Fri, 07 Oct 2016 08:13:50 +0000         Fri, 07 Oct 2016 08:14:30 +0000         NodeStatusUnknown Kubelet stopped posting node status.
  MemoryPressure        False           Fri, 07 Oct 2016 08:13:50 +0000         Thu, 06 Oct 2016 23:57:52 +0000         KubeletHasSufficientMemory        kubelet has sufficient memory available
  DiskPressure          False           Fri, 07 Oct 2016 08:13:50 +0000         Thu, 06 Oct 2016 23:57:52 +0000         KubeletHasNoDiskPressure  kubelet has no disk pressure
  Ready                 Unknown         Fri, 07 Oct 2016 08:13:50 +0000         Fri, 07 Oct 2016 08:14:30 +0000         NodeStatusUnknown Kubelet stopped posting node status.
Addresses:              10.138.0.2,10.138.0.2
Capacity:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   1
 memory:                                1737208Ki
 pods:                                  110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   1
 memory:                                1737208Ki
 pods:                                  110
System Info:
 Machine ID:                    af77f36e18459f0d0d262ed74e977e59
 System UUID:                   AF77F36E-1845-9F0D-0D26-2ED74E977E59
 Boot ID:                       617db356-a6da-4099-9b63-ad5f993178fd
 Kernel Version:                4.4.0-38-generic
 OS Image:                      Ubuntu 16.04.1 LTS
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.11.2
 Kubelet Version:               v1.4.0
 Kube-Proxy Version:            v1.4.0
ExternalID:                     operate
Non-terminated Pods:            (8 in total)
  Namespace                     Name                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                            ------------    ----------      --------------- -------------
  kube-system                   etcd-operate                                    200m (20%)      0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-controller-manager-operate                 200m (20%)      0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-discovery-982812725-kkarx                  0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-dns-2247936740-fse3h                       210m (21%)      210m (21%)      390Mi (22%)     390Mi (22%)
  kube-system                   kube-proxy-amd64-x3x3m                          0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-scheduler-operate                          100m (10%)      0 (0%)          0 (0%)          0 (0%)
  kube-system                   kubernetes-dashboard-1655269645-0hzho           0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   weave-net-r38tz                                 20m (2%)        0 (0%)          0 (0%)          0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  730m (73%)    210m (21%)      390Mi (22%)     390Mi (22%)

I've tried restarting the server to no success. How would I debug this? Thanks

@axsuul
Copy link
Author

axsuul commented Oct 14, 2016

This issue came up again. I've tried debugging with

$ sudo journalctl -u kubelet

to view logs. Nothing out of the ordinary. Also fine here:

$ systemctl status kubelet
$ systemctl status docker

How can I debug this?

@axsuul
Copy link
Author

axsuul commented Oct 14, 2016

Ok this was related to changing the kubeadm cluster IP... I think.

@axsuul axsuul closed this as completed Oct 14, 2016
@tuananh
Copy link

tuananh commented Nov 16, 2016

I had the same issue. I'm using GKE (google container)

@wstrange
Copy link
Contributor

wstrange commented Jan 4, 2017

I just ran into this - on GKE 1.5.1 with alpha features turned on

The problem appeared when the cluster auto-scaled. The first node went to status NotReady
and status:
Kubelet stopped posting node status

The node was non-responsive - I could not ssh into it. Restarting the node cleared the status

@dev-e
Copy link

dev-e commented Jan 23, 2017

The same problem on CoreOS. k8s 1.5.2. After recreating /var/lib/kubelet directory and re-registering master node I get this repeating messages in the log:

E0123 08:22:50.647822 887 kubelet_node_status.go:302] Error updating node status, will retry: Operation cannot be fulfilled on nodes "z14-0546-amis-c.vesta.ru": the object has been modified; please apply your changes to the latest version and try again

Node status becomes "NotReady" and pods, created by ReplicationContorllers with NodeSelector value of this node, get status "Pending", reason: "MatchNodeSelector". Reboot does not make sence.

@greglearns
Copy link

I just had the same problem k8s 1.4.7 stable. Very little was running on my cluster (1 master, 2 workers) other than Deis, running on AWS launched by Kops. Both workers had the same problems as above. AWS CloudWatch reported everything was fine on all servers.

Name:                   ip-172-20-116-89.us-west-2.compute.internal
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t2.micro
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=us-west-2
                        failure-domain.beta.kubernetes.io/zone=us-west-2c
                        kubernetes.io/hostname=ip-172-20-116-89.us-west-2.compute.internal
Taints:                 <none>
CreationTimestamp:      Tue, 24 Jan 2017 20:52:53 -0700
Phase:
Conditions:
  Type                  Status          LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------          -----------------                       ------------------                      ------                          -------
  OutOfDisk             Unknown         Fri, 27 Jan 2017 10:38:42 -0700         Fri, 27 Jan 2017 10:39:26 -0700         NodeStatusUnknown               Kubelet stopped posting node status.
  MemoryPressure        False           Fri, 27 Jan 2017 10:38:42 -0700         Tue, 24 Jan 2017 20:52:53 -0700         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False           Fri, 27 Jan 2017 10:38:42 -0700         Tue, 24 Jan 2017 20:52:53 -0700         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 Unknown         Fri, 27 Jan 2017 10:38:42 -0700         Fri, 27 Jan 2017 10:39:26 -0700         NodeStatusUnknown               Kubelet stopped posting node status.
  NetworkUnavailable    False           Sat, 28 Jan 2017 08:03:04 -0700         Sat, 28 Jan 2017 08:03:04 -0700         RouteCreated                    RouteController created a route

@dev-e
Copy link

dev-e commented Feb 1, 2017

Problem solved by applying changes to kubelet configuration (/etc/systemd/system/kubelet.service) according to latest version of reference page on CoreOS: https://coreos.com/kubernetes/docs/latest/deploy-master.html

@skylabreddy
Copy link

I am also facing same issue.

I see some issue after deploy the app. App is deployed success But running is 0.

root@kubernetes:# kubectl run kubernetes-bootcamp --image=docker.io/jocatalin/kubernetes-bootcamp:v1 --port=8080
deployment "kubernetes-bootcamp" created
root@kubernetes:
# kubectl get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kubernetes-bootcamp 1 1 1 0 15s
skycouch 1 1 1 0 2d
test 1 1 1 0 3d

Can you suggest and give me the appropriate suggestion.

root@kubernetes:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubenode1 NotReady 3d v1.8.3
kubenode2 NotReady 3d v1.8.3
kubernetes NotReady master 3d v1.8.3

Thanks
Skylab

@viveksinghggits
Copy link

Ok this was related to changing the kubeadm cluster IP... I think.

@axsuul were you able to resolve the issue, can you share the details. I also encountered the same issue where the master and worker is on the same node (one node cluster).

@axsuul
Copy link
Author

axsuul commented Mar 14, 2019

@viveksinghggits Sorry I ended up moving to Docker Swarm and I don't remember the details anymore, sorry

@SaltedEggIndomee
Copy link

I'm having the same issue on EKS with Kubernetes 1.12.

Minimal steps to reproduce:

  1. Create a deployment with 1 replica. 2 Nodes.
  2. Create a HPA with 50% cpu target, minpods 1, maxpods3
  3. Overload the cpu on the first Pod
  4. Watch HPA scaling with "kubectl get hpa -w"
  5. After 1 minute, see 1 Node go down with NotReady status.
  6. After 30 mins, Node still in NotReady status. Even after HPA has scaled down back to 1 Pod.

Rebooting the EC2 instance doesn't help.

@ghost
Copy link

ghost commented May 17, 2019

I'm having the same issue. Is the issue resolved ?. If yes, can anyone provide step by step instructions on resolving the issue?

@JnMik
Copy link

JnMik commented Jul 4, 2019

Happens to me as well in AWS EKS.
Any hint ?

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 08:26:42 -0400   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.

Can't log into the instance to inspect kubelet. Seems the instance is frozen or something

Edit: Follow up here awslabs/amazon-eks-ami#79

@ghost
Copy link

ghost commented Jul 4, 2019 via email

@bobbui
Copy link

bobbui commented Jul 5, 2019

Happens to me as well in AWS EKS.
Any hint ?

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 08:26:42 -0400   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Thu, 04 Jul 2019 10:12:19 -0400   Thu, 04 Jul 2019 10:13:04 -0400   NodeStatusUnknown         Kubelet stopped posting node status.

Can't log into the instance to inspect kubelet. Seems the instance is frozen or something

Edit: Follow up here awslabs/amazon-eks-ami#79

Happen to me as well, started to happen when I was running the stress test against the services running inside cluster

@mansurali901
Copy link

For me first you find any HPA that is exceeding resources delete the HPA will work

@bagulm123
Copy link

Is there any solution to this issue? I have observed it when my cluster got autoscaled. The first worker node became Not Ready and its in the same status till now (After 8 hours).

@truongtrevor
Copy link

same issue here using minikube

@truongtrevor
Copy link

CreationTimestamp: Sun, 26 Jul 2020 18:41:43 +0700
Taints: node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: localhost.localdomain
AcquireTime:
RenewTime: Sun, 26 Jul 2020 19:44:19 +0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure Unknown Sun, 26 Jul 2020 19:42:21 +0700 Sun, 26 Jul 2020 21:26:06 +0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Sun, 26 Jul 2020 19:42:21 +0700 Sun, 26 Jul 2020 21:26:06 +0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Sun, 26 Jul 2020 19:42:21 +0700 Sun, 26 Jul 2020 21:26:06 +0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Sun, 26 Jul 2020 19:42:21 +0700 Sun, 26 Jul 2020 21:26:06 +0700 NodeStatusUnknown Kubelet stopped posting node status.

@nemo-xue
Copy link

Hi, the issue is closed. But does anyone has a solution for it?

@immanuelfodor
Copy link

Maybe this thread helps you, you probably need to reserve resources for host daemons using kubelet args: rancher/rancher#29997 (comment)

@nemo-xue
Copy link

Thanks @immanuelfodor .
I found there were many Pending csr.
This command helps to solve my issue: "oc get csr -o name | xargs oc adm certificate approve"

@thomasresley
Copy link

thomasresley commented Mar 12, 2021

The problem is likely to be, the memory and processing resources within the clusters don't match the workload. That is you have exhausted the cluster resources and you need to deploy more worker nodes. Restart all the instances all at once. Give them some time to reboot and restart all the Kubernetes resources on the cluster. Worked for me on AWS

@dinesh25cs
Copy link

dinesh25cs commented Feb 11, 2022

I got the same issue we have debuged using below commands it works really

KUBERNETES:
Deleting node and rejoining it to the cluster:
On MASTER:

  1. kubectl undordon node_name
  2. kubectl delete node node_name
  3. kubeadm token create --print-join-command (show the kubeadm join token info)
    On NODE:
  4. kubeadm reset
  5. kubeadm join 10.87.208.94:6443 --token eah77w.1yfl82ahipkdr1da --discovery-token-ca-cert-hash sha256:15e3637fa73615d30b97c162e610709384c8a395755dd6bba7982cde1a458da8
    [preflight] Running pre-flight checks

[root@cerebro05 etc]# kubeadm join 10.87.208.94:6443 --token eah77w.1yfl82ahipkdr1da --discovery-token-ca-cert-hash sha256:15e3637fa73615d30b97c162e610709384c8a395755dd6bba7982cde1a458da8
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher

@rajaduraicloud
Copy link

Check swap on or off ---> free -m
if swap is on , turn off --->sudo swapoff -a
now its works.....!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests