runc hang on init when containerd set up #4481

smileusd · 2024-10-28T08:56:57Z

Description

I find some D state process on node which containerd set up

5 D root      14378  13862  0  80   0 - 269979 refrig Sep30 ?       00:00:00 /usr/local/bin/runc init
5 D root      14392  13587  0  80   0 - 270107 refrig Sep30 ?       00:00:00 /usr/local/bin/runc init
0 S root     278169 276735  0  80   0 -  1007 pipe_r 00:44 pts/2    00:00:00 grep --color=auto  D 
root@hsotname:~# cat /proc/14378/stack 
[<0>] __refrigerator+0x4c/0x130
[<0>] unix_stream_data_wait+0x1fa/0x210
[<0>] unix_stream_read_generic+0x50d/0xa60
[<0>] unix_stream_recvmsg+0x88/0x90
[<0>] sock_recvmsg+0x70/0x80
[<0>] sock_read_iter+0x8f/0xf0
[<0>] new_sync_read+0x180/0x190
[<0>] vfs_read+0xff/0x1a0
[<0>] ksys_read+0xb1/0xe0
[<0>] __x64_sys_read+0x19/0x20
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
root@hostname:~# uptime
 01:32:12 up 28 days, 38 min,  2 users,  load average: 29.57, 31.53, 31.98
root@hostname:~# systemctl status containerd
● containerd.service - containerd container runtime
     Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-09-30 00:53:45 -07; 4 weeks 0 days ago
root@hostname~# ps -eo pid,lstart,cmd,state |grep 14378
 14378 Mon Sep 30 00:53:38 2024 /usr/local/bin/runc init    D
root@hostname:~# stat /var/containerd/containerd.sock
  File: /var/containerd/containerd.sock
  Size: 0               Blocks: 0          IO Block: 4096   socket
Device: 10303h/66307d   Inode: 1082291752  Links: 1
Access: (0660/srw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-10-28 00:45:08.361633324 -0700
Modify: 2024-09-30 00:53:45.666038162 -0700
Change: 2024-09-30 00:53:45.666038162 -0700
 Birth: 2024-09-30 00:53:45.666038162 -0700

The runc init process set up before /var/containerd/containerd.sock changed. I think there is something race on it? But i think the runc process should wait timeout and exit.

Steps to reproduce the issue

No response

Describe the results you received and expected

The runc init hang. Expected no D state process.

What version of runc are you using?

~# runc --version
runc version 1.1.2
commit: c4f88bc9
spec: 1.0.2-dev
go: go1.17.13
libseccomp: 2.5.3

Host OS information

~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
BUILD_ID="ubuntu-240918-061134"

Host kernel information

~# uname -a
Linux tess-node-ttbts-tess134.stratus.lvs.ebay.com 5.15.0-26-generic #26 SMP Wed Sep 18 09:16:49 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

smileusd · 2024-10-28T08:57:49Z

/kind bug

lifubang · 2024-10-28T09:37:51Z

runc version 1.1.2

Could you please update runc to v1.1.14 to check whether this situation exists or not?
https://github.com/opencontainers/runc/releases/tag/v1.1.14

kolyshkin · 2024-10-28T22:31:11Z

This probably means runc was killed in the middle of container creation, and thus its child . I barely remember we did something about it, so yes, it makes sense to try latest runc 1.2.0 or a newer 1.1.x release (latest being 1.1.15 ATM).

cyphar · 2024-10-29T17:04:48Z

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

kolyshkin · 2024-10-29T23:50:27Z

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.

@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

cheese · 2024-11-01T10:40:11Z

Met the same issue with runc 1.1.12 and k3s 1.29.4:

# cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfc602c7f_74d9_4696_bd15_5e3a6433e012.slice/cri-containerd-781fbd52240e017d380aa3cccf42ef379a50c32c703363d0b1e9c1fb10bf17b1.scope/freezer.state
FROZEN

wxx213 · 2024-11-11T11:40:44Z

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.

@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

wxx213 · 2024-11-11T11:44:18Z

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.
@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

@kolyshkin runc may need to consider the cgroup FROZEN state when delete a container

jianghao65536 · 2024-11-11T11:57:35Z

Being stuck in __refrigerator means that the code is in a frozen cgroupv2 cgroup. I'm pretty sure we had some patches in the past 2 years that fixed this issue?

Right! There were fixes in #3223, but they made it to v1.1.0. We might have some more fixed on top of this though, plus, I guess, someone can freeze a cgroup mid-flight resulting in the same stuck runc init.
@smileusd can you check if cgroups these runc init processes are in are in a frozen state?

The runc process may be killed because of the context timeout(which is gpc call timeout from kubelet) when it just set FROZEN for the container cgroup, we met this case in host high load situation even if our runc has this fix.

@kolyshkin runc may need to consider the cgroup FROZEN state when delete a container

@kolyshkin If you want to replicate this issue, you can add a time.Sleep command before this line of code, making sure the sleep duration is longer than the context's timeout period.

jianghao65536 mentioned this issue Nov 18, 2024

The 'runc delete --force' command can't delete container if runc receives a SIGKILL before it can generate the state.json file. #4534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc hang on init when containerd set up #4481

runc hang on init when containerd set up #4481

smileusd commented Oct 28, 2024

smileusd commented Oct 28, 2024

lifubang commented Oct 28, 2024 •

edited

Loading

kolyshkin commented Oct 28, 2024

cyphar commented Oct 29, 2024

kolyshkin commented Oct 29, 2024

cheese commented Nov 1, 2024 •

edited

Loading

wxx213 commented Nov 11, 2024

wxx213 commented Nov 11, 2024

jianghao65536 commented Nov 11, 2024

runc hang on init when containerd set up #4481

runc hang on init when containerd set up #4481

Comments

smileusd commented Oct 28, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

smileusd commented Oct 28, 2024

lifubang commented Oct 28, 2024 • edited Loading

kolyshkin commented Oct 28, 2024

cyphar commented Oct 29, 2024

kolyshkin commented Oct 29, 2024

cheese commented Nov 1, 2024 • edited Loading

wxx213 commented Nov 11, 2024

wxx213 commented Nov 11, 2024

jianghao65536 commented Nov 11, 2024

lifubang commented Oct 28, 2024 •

edited

Loading

cheese commented Nov 1, 2024 •

edited

Loading