Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods faiiling after restart of VM #8850

Closed
dimakyriakov opened this issue May 20, 2022 · 19 comments
Closed

Pods faiiling after restart of VM #8850

dimakyriakov opened this issue May 20, 2022 · 19 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dimakyriakov
Copy link

Environment:

  • hardware configuration:

  • OS Ubuntu 20.04.4 LTS

  • Version of Ansible: ansible 2.10.15

  • Version of Python3: Python 3.8.10

Kubespray version (commit): 2cc5f04

Full inventory with variables:

all:
  hosts:
    node1:
      ansible_host: 192.168.2.211
      ip: 192.168.2.211
      access_ip: 192.168.2.211
    node2:
      ansible_host: 192.168.2.212
      ip: 192.168.2.212
      access_ip: 192.168.2.212
    node3:
      ansible_host: 192.168.2.213
      ip: 192.168.2.213
      access_ip: 192.168.2.213
    node4:
      ansible_host: 192.168.2.214
      ip: 192.168.2.214
      access_ip: 192.168.2.214
  children:
    kube_control_plane:
      hosts:
        node1:
        node2:
    kube_node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/newCluster/hosts.yaml --become --become-user=root cluster.yml

Output of ansible run:

image
image

Anything else do we need to know:
After I rebooted my VM where I was installed master k8s node - all pods can't up because this error "Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice""

@dimakyriakov dimakyriakov added the kind/bug Categorizes issue or PR as related to a bug. label May 20, 2022
@yankay
Copy link
Member

yankay commented May 24, 2022

HI , @dimakyriakov . I try it at ubuntu 20.04.4 LTS. Everything is OK.
According to kubernetes/minikube#5223.
Would you please give me more information about the docker's cgroupdriver config, and the kubelet's cgroupdriver config

@dimakyriakov
Copy link
Author

dimakyriakov commented May 25, 2022

Hello, @yankay
piece of k8s-cluster.yml file:
image

docker-options.conf:
image
I don't have daemon.json by default

Kubelet is down after reboot:
image

kubelet.env:
image

@julienlau
Copy link

julienlau commented Jul 28, 2022

same problem here on a set of ubuntu 20.04 desktop VMs with kubespray commit c24a3a3
and kube version 1.24.3

Install went fine with the follwing command on a setup with a single master node and 2 worker nodes.

ansible-playbook -T 20 -i inventory/local/hosts.yml cluster.yml -v -e container_manager=docker

Reboot is CHAOS !

  • I was forced to apply this tweaks on worker nodes for them to restart properly :
sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service
# after each reboot DNS may be broken. If 192.168.122.1 was your DNS:
sudo systemd-resolve --interface enp1s0 --set-dns 192.168.122.1
for f in admin.conf controller-manager.conf kubelet.conf scheduler.conf ; do
sudo [ `ls /etc/kubernetes/$f.* 2>/dev/null | wc -l` -eq 1 ] && sudo cp /etc/kubernetes/$f /etc/kubernetes/bk-$f && sudo cp /etc/kubernetes/$f.* /etc/kubernetes/$f
done
  • I cannot restart my single master node due to this issue on docker slice :
[root@k8s-master-1]> docker start k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"
Error: failed to start containers: k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
journalctl -r -u kubelet
-- Logs begin at Mon 2022-07-25 15:33:06 CEST, end at Thu 2022-07-28 10:23:22 CEST. --
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: I0728 10:23:22.157936    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.146025    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.045172    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.968052    1938 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1:6443/api/v1/nodes\": dial tcp 127.0.0>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967381    1938 kubelet_node_status.go:70] "Attempting to register node" node="k8s-master-1"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967241    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967069    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.966780    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.957578    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.944303    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.842241    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.740667    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.640006    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.539524    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.438500    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.337455    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.236156    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.157571    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.135497    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.035193    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.934746    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.833561    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.732640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.631545    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.530963    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.430640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.329996    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299770    1938 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-k8s-master-1_kube-sy>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299729    1938 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299700    1938 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299635    1938 remote_runtime.go:212] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create a sandbox for >
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295429    1938 kuberuntime_manager.go:488] "No ready sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-controller-manager-k8>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295032    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295019    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.294948    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.285246    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.228558    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.158092    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.127582    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"

I think it may be caused by the fact that cgroupv2 are disabled !

ll /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory

etcd is running fine.

@julienlau
Copy link

julienlau commented Jul 28, 2022

enabling cgroupv2 makes reboot of master nodes possible ... :(
Pre-req : make sure cgroupv2 is enabled on hosts :

# if file exists cgroupv2 is OK
ll /sys/fs/cgroup/cgroup.controllers
# enable:
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo sed -i -e 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo update-grub

The kubernetes cluster reboot but I still have the coredns pods that do not restart :

│ 2022/07/28 12:26:10 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e   
│ [FATAL] plugin/loop: Loop (169.254.25.10:59357 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4659600271498259777.1850538492869913665."           │
│ stream closed   

Logs of coredns on the worker node that did not restart shows :

│ 2022/07/28 10:52:52 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e 
│ CoreDNS-1.7.0

I still need this tweaks on all nodes :

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service

@julienlau
Copy link

julienlau commented Jul 28, 2022

issue with coredns seems to be linked to #5835

As @k8s-ci-robot mentioned Ubuntu 20.04 support is merged in master. In addition to the coredns loop error, there are a couple of other things that can come up depending on your environment:
- IPVS mode not supported with the KVM kernel in Ubuntu 20.04
- Mitogen + Ubuntu 20.04 requires to specific the python interpreter path
- enable_nodelocaldns: false required in some instances

I use KVM...

@julienlau
Copy link

enable_nodelocaldns: false does not solve the issue with the core dns crash loop.
The crash loop now occurs only on master nodes and not on every nodes.

@julienlau
Copy link

using iptables instead of ipvs does not solve the core dns crash loop after reboot.

@julienlau
Copy link

julienlau commented Jul 28, 2022

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

@floryut
Copy link
Member

floryut commented Jul 29, 2022

Great if you found a way to fix your coredns issue 👍

@julienlau
Copy link

@floryut do you know is the test suite include a simple restart of the cluster ?

@floryut
Copy link
Member

floryut commented Aug 1, 2022

@floryut do you know is the test suite include a simple restart of the cluster ?

They do not, it would be possible to add ones though. But come to think of your issue it is strange that you have to disable systemd-resolved, to have coredns work pointing the resolv.conf to /run/systemd/resolve/resolv.conf was enough to fix coredns issue on Ubuntu AFAIK

@julienlau
Copy link

The issues with coredns appeared only after restarting. The initial install went fine.

@floryut
Copy link
Member

floryut commented Aug 2, 2022

I'll try to spinup a ubuntu cluster to see if I can reproduce this behavior but as you're the first to report it, it would be strange if this was a bug in our codebase

@jqiuyin
Copy link

jqiuyin commented Oct 24, 2022

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

I had a similar problem
This seems to be related to #3979

@lightoyou
Copy link

same problem here on debian

sudo systemctl disablesystemd-resolved

not solve the issue...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants