Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc hang on init when containerd set up #4481

Open
smileusd opened this issue Oct 28, 2024 · 9 comments
Open

runc hang on init when containerd set up #4481

smileusd opened this issue Oct 28, 2024 · 9 comments

Comments

@smileusd
Copy link

Description

I find some D state process on node which containerd set up

5 D root      14378  13862  0  80   0 - 269979 refrig Sep30 ?       00:00:00 /usr/local/bin/runc init
5 D root      14392  13587  0  80   0 - 270107 refrig Sep30 ?       00:00:00 /usr/local/bin/runc init
0 S root     278169 276735  0  80   0 -  1007 pipe_r 00:44 pts/2    00:00:00 grep --color=auto  D 
root@hsotname:~# cat /proc/14378/stack 
[<0>] __refrigerator+0x4c/0x130
[<0>] unix_stream_data_wait+0x1fa/0x210
[<0>] unix_stream_read_generic+0x50d/0xa60
[<0>] unix_stream_recvmsg+0x88/0x90
[<0>] sock_recvmsg+0x70/0x80
[<0>] sock_read_iter+0x8f/0xf0
[<0>] new_sync_read+0x180/0x190
[<0>] vfs_read+0xff/0x1a0
[<0>] ksys_read+0xb1/0xe0
[<0>] __x64_sys_read+0x19/0x20
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
root@hostname:~# uptime
 01:32:12 up 28 days, 38 min,  2 users,  load average: 29.57, 31.53, 31.98
root@hostname:~# systemctl status containerd
● containerd.service - containerd container runtime
     Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-09-30 00:53:45 -07; 4 weeks 0 days ago
root@hostname~# ps -eo pid,lstart,cmd,state |grep 14378
 14378 Mon Sep 30 00:53:38 2024 /usr/local/bin/runc init    D
root@hostname:~# stat /var/containerd/containerd.sock
  File: /var/containerd/containerd.sock
  Size: 0               Blocks: 0          IO Block: 4096   socket
Device: 10303h/66307d   Inode: 1082291752  Links: 1
Access: (0660/srw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-10-28 00:45:08.361633324 -0700
Modify: 2024-09-30 00:53:45.666038162 -0700
Change: 2024-09-30 00:53:45.666038162 -0700
 Birth: 2024-09-30 00:53:45.666038162 -0700

The runc init process set up before /var/containerd/containerd.sock changed. I think there is something race on it? But i think the runc process should wait timeout and exit.

Steps to reproduce the issue

No response

Describe the results you received and expected

The runc init hang. Expected no D state process.

What version of runc are you using?

~# runc --version
runc version 1.1.2
commit: c4f88bc9
spec: 1.0.2-dev
go: go1.17.13
libseccomp: 2.5.3

Host OS information

~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
BUILD_ID="ubuntu-240918-061134"

Host kernel information

~# uname -a
Linux tess-node-ttbts-tess134.stratus.lvs.ebay.com 5.15.0-26-generic #26 SMP Wed Sep 18 09:16:49 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@smileusd
Copy link
Author

/kind bug

@lifubang
Copy link
Member

lifubang commented Oct 28, 2024

runc version 1.1.2

Could you please update runc to v1.1.14 to check whether this situation exists or not?
https://github.com/opencontainers/runc/releases/tag/v1.1.14

@kolyshkin
Copy link
Contributor

This probably means runc was killed in the middle of container creation, and thus its child . I barely remember we did something about it, so yes, it makes sense to try latest runc 1.2.0 or a newer 1.1.x release (latest being 1.1.15 ATM).

@cyphar
Copy link
Member

cyphar commented Oct 29, 2024

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

@kolyshkin
Copy link
Contributor

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.

@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

@cheese
Copy link

cheese commented Nov 1, 2024

Met the same issue with runc 1.1.12 and k3s 1.29.4:

# cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfc602c7f_74d9_4696_bd15_5e3a6433e012.slice/cri-containerd-781fbd52240e017d380aa3cccf42ef379a50c32c703363d0b1e9c1fb10bf17b1.scope/freezer.state
FROZEN

@wxx213
Copy link

wxx213 commented Nov 11, 2024

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.

@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

@wxx213
Copy link

wxx213 commented Nov 11, 2024

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.
@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

@kolyshkin runc may need to consider the cgroup FROZEN state when delete a container

@jianghao65536
Copy link

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.
@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

@kolyshkin runc may need to consider the cgroup FROZEN state when delete a container

@kolyshkin If you want to replicate this issue, you can add a time.Sleep command before this line of code, making sure the sleep duration is longer than the context's timeout period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@cheese @lifubang @cyphar @kolyshkin @smileusd @wxx213 @jianghao65536 and others