runc run: fix mount leak #4417

kolyshkin · 2024-09-26T23:54:30Z

When preparing to mount container root, we need to make its parent mount private (i.e. disable mount propagation), otherwise the new in-container mounts are leaked to the host.

To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory.

Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see 1, 2), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount.

As a result, we do not change the mount propagation to private, and container mounts are leaked.

Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons.

Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it).

Fixes: #2404.

cyphar · 2024-09-27T13:46:00Z

What happens if the parent mount is a bind-mount and the container rootfs is also a bind-mount from the same filesystem? Unless I'm mistaken, this device-checking logic will skip those mounts and produce errors when we pivot_root.

There are a few ways we could handle this AFAICS:

openat2(current, "..", RESOLVE_NO_XDEV) to find the crossing -- Linux 5.6.
statx(STATX_MNT_ID) to get the mount ID and use that to find the crossing -- Linux 5.8. (Sadly these mount IDs are not completely unique -- that was fixed in Linux 6.8 with STATX_MNT_ID_UNIQUE. But that should still work for us.)
name_to_handle_at with a dummy fhandle to get the mount ID -- available for a long time but only works on filesystems that can do NFS exports (at least until AT_HANDLE_FID in Linux 6.5).

Maybe we can try to do (1) for newer kernels and fall back to (3) for older kernels, with a final fallback to st_dev if nothing else works...

kolyshkin · 2024-10-02T01:29:53Z

I just came up with a stupid but simpler alternative.

When preparing to mount container root, we need to make its parent mount private (i.e. disable propagation), otherwise the new in-container mounts are leaked to the host. To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory. Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see [1], [2]), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount. As a result, we do not change the mount propagation to private, and container mounts are leaked. Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons. Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it). Fixes issue 2404. [1]: https://github.com/kolyshkin/procfs-test [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2024-10-02T21:24:14Z

@cyphar WDYT?

Ah, I just found crun does the same thing. This probably means the approach is sound (yet I'm a tad unhappy I'm not the first one who came with this idea 😿).

kolyshkin · 2024-10-02T21:44:43Z

I also think we can make a 1.1 backport since the issue, if rare, is quite nasty, and the fix seems to be simple.

cyphar · 2024-10-02T23:32:50Z

I checked and do_change_type (the propagation change logic) actually does a very cheap check for whether the path is a mountpoint before taking namespace_sem or mount_lock, so this approach is basically what you'd get from openat2(RESOLVE_NO_XDEV) except it works on older kernels.

kolyshkin · 2024-10-03T17:58:03Z

1.1 backport: #4425

kolyshkin mentioned this pull request Sep 26, 2024

runc has problems due to leaked mount information #2404

Closed

kolyshkin force-pushed the fix-mount-leak branch from 1263560 to b9ec48e Compare September 27, 2024 00:01

kolyshkin force-pushed the fix-mount-leak branch 2 times, most recently from dc0ca3a to f3f7aa4 Compare October 2, 2024 01:29

kolyshkin force-pushed the fix-mount-leak branch from f3f7aa4 to 08105c9 Compare October 2, 2024 01:33

kolyshkin force-pushed the fix-mount-leak branch from 08105c9 to 13a6f56 Compare October 2, 2024 20:58

kolyshkin marked this pull request as ready for review October 2, 2024 21:17

kolyshkin requested review from lifubang, thaJeztah, cyphar and AkihiroSuda October 2, 2024 21:43

cyphar approved these changes Oct 2, 2024

View reviewed changes

kolyshkin mentioned this pull request Oct 3, 2024

[1.1] runc run: fix mount leak #4425

Merged

kolyshkin added the backport/1.1-done A PR in main branch which has been backported to release-1.1 label Oct 3, 2024

AkihiroSuda approved these changes Oct 4, 2024

View reviewed changes

AkihiroSuda merged commit db25439 into opencontainers:main Oct 4, 2024
42 checks passed

github-actions bot mentioned this pull request Oct 27, 2024

Bump runc from v1.1.13 to v1.2.0 kokyhm/kubespray#60

Open

lifubang mentioned this pull request Nov 1, 2024

rootfs: make pivot_root(2) dance handle initramfs case #4434

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc run: fix mount leak #4417

runc run: fix mount leak #4417

kolyshkin commented Sep 26, 2024 •

edited

Loading

cyphar commented Sep 27, 2024 •

edited

Loading

kolyshkin commented Oct 2, 2024

kolyshkin commented Oct 2, 2024

kolyshkin commented Oct 2, 2024

cyphar commented Oct 2, 2024

kolyshkin commented Oct 3, 2024

runc run: fix mount leak #4417

runc run: fix mount leak #4417

Conversation

kolyshkin commented Sep 26, 2024 • edited Loading

cyphar commented Sep 27, 2024 • edited Loading

kolyshkin commented Oct 2, 2024

kolyshkin commented Oct 2, 2024

kolyshkin commented Oct 2, 2024

cyphar commented Oct 2, 2024

kolyshkin commented Oct 3, 2024

kolyshkin commented Sep 26, 2024 •

edited

Loading

cyphar commented Sep 27, 2024 •

edited

Loading