Pods faiiling after restart of VM #8850

dimakyriakov · 2022-05-20T11:29:42Z

Environment:

hardware configuration:
OS Ubuntu 20.04.4 LTS
Version of Ansible: ansible 2.10.15
Version of Python3: Python 3.8.10

Kubespray version (commit): 2cc5f04

Full inventory with variables:

all:
  hosts:
    node1:
      ansible_host: 192.168.2.211
      ip: 192.168.2.211
      access_ip: 192.168.2.211
    node2:
      ansible_host: 192.168.2.212
      ip: 192.168.2.212
      access_ip: 192.168.2.212
    node3:
      ansible_host: 192.168.2.213
      ip: 192.168.2.213
      access_ip: 192.168.2.213
    node4:
      ansible_host: 192.168.2.214
      ip: 192.168.2.214
      access_ip: 192.168.2.214
  children:
    kube_control_plane:
      hosts:
        node1:
        node2:
    kube_node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/newCluster/hosts.yaml --become --become-user=root cluster.yml

Output of ansible run:

Anything else do we need to know:
After I rebooted my VM where I was installed master k8s node - all pods can't up because this error "Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice""

The text was updated successfully, but these errors were encountered:

yankay · 2022-05-24T07:05:37Z

HI , @dimakyriakov . I try it at ubuntu 20.04.4 LTS. Everything is OK.
According to kubernetes/minikube#5223.
Would you please give me more information about the docker's cgroupdriver config, and the kubelet's cgroupdriver config

dimakyriakov · 2022-05-25T07:51:40Z

Hello, @yankay
piece of k8s-cluster.yml file:

docker-options.conf:

I don't have daemon.json by default

Kubelet is down after reboot:

kubelet.env:

julienlau · 2022-07-28T07:52:16Z

same problem here on a set of ubuntu 20.04 desktop VMs with kubespray commit c24a3a3
and kube version 1.24.3

Install went fine with the follwing command on a setup with a single master node and 2 worker nodes.

ansible-playbook -T 20 -i inventory/local/hosts.yml cluster.yml -v -e container_manager=docker

Reboot is CHAOS !

I was forced to apply this tweaks on worker nodes for them to restart properly :

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service
# after each reboot DNS may be broken. If 192.168.122.1 was your DNS:
sudo systemd-resolve --interface enp1s0 --set-dns 192.168.122.1
for f in admin.conf controller-manager.conf kubelet.conf scheduler.conf ; do
sudo [ `ls /etc/kubernetes/$f.* 2>/dev/null | wc -l` -eq 1 ] && sudo cp /etc/kubernetes/$f /etc/kubernetes/bk-$f && sudo cp /etc/kubernetes/$f.* /etc/kubernetes/$f
done

I cannot restart my single master node due to this issue on docker slice :

[root@k8s-master-1]> docker start k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"
Error: failed to start containers: k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0

journalctl -r -u kubelet
-- Logs begin at Mon 2022-07-25 15:33:06 CEST, end at Thu 2022-07-28 10:23:22 CEST. --
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: I0728 10:23:22.157936    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.146025    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.045172    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.968052    1938 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1:6443/api/v1/nodes\": dial tcp 127.0.0>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967381    1938 kubelet_node_status.go:70] "Attempting to register node" node="k8s-master-1"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967241    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967069    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.966780    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.957578    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.944303    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.842241    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.740667    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.640006    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.539524    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.438500    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.337455    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.236156    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.157571    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.135497    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.035193    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.934746    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.833561    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.732640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.631545    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.530963    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.430640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.329996    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299770    1938 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-k8s-master-1_kube-sy>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299729    1938 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299700    1938 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299635    1938 remote_runtime.go:212] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create a sandbox for >
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295429    1938 kuberuntime_manager.go:488] "No ready sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-controller-manager-k8>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295032    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295019    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.294948    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.285246    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.228558    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.158092    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.127582    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"

I think it may be caused by the fact that cgroupv2 are disabled !

ll /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory

etcd is running fine.

julienlau · 2022-07-28T09:12:31Z

enabling cgroupv2 makes reboot of master nodes possible ... :(
Pre-req : make sure cgroupv2 is enabled on hosts :

# if file exists cgroupv2 is OK
ll /sys/fs/cgroup/cgroup.controllers
# enable:
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo sed -i -e 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo update-grub

The kubernetes cluster reboot but I still have the coredns pods that do not restart :

│ 2022/07/28 12:26:10 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e   
│ [FATAL] plugin/loop: Loop (169.254.25.10:59357 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4659600271498259777.1850538492869913665."           │
│ stream closed

Logs of coredns on the worker node that did not restart shows :

│ 2022/07/28 10:52:52 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e 
│ CoreDNS-1.7.0

I still need this tweaks on all nodes :

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service

julienlau · 2022-07-28T12:43:49Z

issue with coredns seems to be linked to #5835

As @k8s-ci-robot mentioned Ubuntu 20.04 support is merged in master. In addition to the coredns loop error, there are a couple of other things that can come up depending on your environment:
- IPVS mode not supported with the KVM kernel in Ubuntu 20.04
- Mitogen + Ubuntu 20.04 requires to specific the python interpreter path
- enable_nodelocaldns: false required in some instances

I use KVM...

julienlau · 2022-07-28T16:58:42Z

enable_nodelocaldns: false does not solve the issue with the core dns crash loop.
The crash loop now occurs only on master nodes and not on every nodes.

julienlau · 2022-07-28T17:46:03Z

using iptables instead of ipvs does not solve the core dns crash loop after reboot.

julienlau · 2022-07-28T18:12:07Z

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

floryut · 2022-07-29T07:34:33Z

Great if you found a way to fix your coredns issue 👍

julienlau · 2022-08-01T13:43:26Z

@floryut do you know is the test suite include a simple restart of the cluster ?

floryut · 2022-08-01T19:42:20Z

@floryut do you know is the test suite include a simple restart of the cluster ?

They do not, it would be possible to add ones though. But come to think of your issue it is strange that you have to disable systemd-resolved, to have coredns work pointing the resolv.conf to /run/systemd/resolve/resolv.conf was enough to fix coredns issue on Ubuntu AFAIK

julienlau · 2022-08-02T06:44:04Z

The issues with coredns appeared only after restarting. The initial install went fine.

floryut · 2022-08-02T06:58:44Z

I'll try to spinup a ubuntu cluster to see if I can reproduce this behavior but as you're the first to report it, it would be strange if this was a bug in our codebase

jqiuyin · 2022-10-24T03:10:26Z

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

I had a similar problem
This seems to be related to #3979

lightoyou · 2023-01-08T23:09:47Z

same problem here on debian

sudo systemctl disablesystemd-resolved

not solve the issue...

k8s-triage-robot · 2023-04-08T23:36:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-09T00:01:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-06-08T00:20:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-08T00:20:56Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dimakyriakov added the kind/bug Categorizes issue or PR as related to a bug. label May 20, 2022

julienlau mentioned this issue Aug 1, 2022

cluster installed with Kubespray cannot be restarted #9140

Closed

mostafaghadimi mentioned this issue Aug 22, 2022

🌱 Enable cri-dockerd service #9201

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods faiiling after restart of VM #8850

Pods faiiling after restart of VM #8850

dimakyriakov commented May 20, 2022

yankay commented May 24, 2022

dimakyriakov commented May 25, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022

julienlau commented Jul 28, 2022

julienlau commented Jul 28, 2022 •

edited

Loading

floryut commented Jul 29, 2022

julienlau commented Aug 1, 2022

floryut commented Aug 1, 2022 •

edited

Loading

julienlau commented Aug 2, 2022

floryut commented Aug 2, 2022 •

edited

Loading

jqiuyin commented Oct 24, 2022

lightoyou commented Jan 8, 2023

k8s-triage-robot commented Apr 8, 2023

k8s-triage-robot commented May 9, 2023

k8s-triage-robot commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

Pods faiiling after restart of VM #8850

Pods faiiling after restart of VM #8850

Comments

dimakyriakov commented May 20, 2022

yankay commented May 24, 2022

dimakyriakov commented May 25, 2022 • edited Loading

julienlau commented Jul 28, 2022 • edited Loading

julienlau commented Jul 28, 2022 • edited Loading

julienlau commented Jul 28, 2022 • edited Loading

julienlau commented Jul 28, 2022

julienlau commented Jul 28, 2022

julienlau commented Jul 28, 2022 • edited Loading

floryut commented Jul 29, 2022

julienlau commented Aug 1, 2022

floryut commented Aug 1, 2022 • edited Loading

julienlau commented Aug 2, 2022

floryut commented Aug 2, 2022 • edited Loading

jqiuyin commented Oct 24, 2022

lightoyou commented Jan 8, 2023

k8s-triage-robot commented Apr 8, 2023

k8s-triage-robot commented May 9, 2023

k8s-triage-robot commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

dimakyriakov commented May 25, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

julienlau commented Jul 28, 2022 •

edited

Loading

floryut commented Aug 1, 2022 •

edited

Loading

floryut commented Aug 2, 2022 •

edited

Loading