libct/nsenter: become root after joining userns #4473

lifubang · 2024-10-24T14:57:39Z

Containerd pre-creates userns and netns before calling runc, which
results in the current code not working when SELinux is enabled,
resulting in the following error:

runc create failed: unable to start container process: error during
container init: error mounting "mqueue" to rootfs at "/dev/mqueue":
setxattr /path/to/rootfs/dev/mqueue: operation not permitted

The solution is to become root in the user namespace right after
we join it.

Fixes #4466.

Co-authored-by: Wei Fu [email protected]
Co-authored-by: Kir Kolyshkin [email protected]
Co-authored-by: Aleksa Sarai [email protected]
Signed-off-by: lifubang [email protected]

rata

@lifubang nice finding! I haven't checked the code yet. But here are some general comments.

This seems to be what crun does too! Here it skips the userns: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L3808. And then it joins it: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4615-L4617

However, crun is also handling a case where the container needs to join a PID namespace: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4562-L4568. I guess we need to do the same?

I'm unsure why we need to join the userns afterwards, though. Do you know why it is needed? Skimming through the man 7 user_namespaces I didn't see any mention

AkihiroSuda

Thanks

libcontainer/nsenter/nsexec.c

fuweid · 2024-10-24T21:52:17Z

libcontainer/nsenter/nsexec.c

@@ -501,6 +506,12 @@ void join_namespaces(char *nslist)
 		struct namespace_t *ns = &namespaces[i];
 		int flag = nsflag(ns->type);

+		if (flag == CLONE_NEWUSER) {


I try to use retsnoop to check and found this in v6.5 kernel

// retsnoop -e '*setxattr' -e 'inode_owner_or_capable' -e 'ns_capable' -E -n 'runc:[2:INIT]' -T 21:47:54.401485 -> 21:47:54.401526 TID/PID 69055/69055 (runc:[2:INIT]/runc:[2:INIT]): FUNCTION CALLS RESULT DURATION ---------------------------------------------------- -------- -------- → __x64_sys_setxattr → path_setxattr → setxattr → do_setxattr → vfs_setxattr → security_inode_setxattr → selinux_inode_setxattr ↔ inode_owner_or_capable [false] 1.100us ← selinux_inode_setxattr [-EPERM] 2.700us ← security_inode_setxattr [-EPERM] 8.900us ← vfs_setxattr [-EPERM] 15.300us ← do_setxattr [-EPERM] 21.400us ← setxattr [-EPERM] 24.000us ← path_setxattr [-EPERM] 29.500us ← __x64_sys_setxattr [-EPERM] 31.700us

https://github.com/torvalds/linux/blob/2dde18cd1d8fac735875f2e4987f11817cc0bc2c/security/selinux/hooks.c#L3151

Is it related to idmap here?

Maybe the net ns path created in containerd is not owned by the user ns?

the path is about rootfs/dev/mqueue instead of net ns.
The net ns is created by new pod user ns. (The pinned net ns path is created by containerd's user ns instead of pod user ns)

The patch works on my local. But it's still mystery to me.

Based on your patch, the change works on my local as well.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c index 74e15b96..350a17fe 100644 --- a/libcontainer/nsenter/nsexec.c +++ b/libcontainer/nsenter/nsexec.c @@ -505,6 +505,11 @@ void join_namespaces(char *nslist) if (setns(ns->fd, flag) < 0) bail("failed to setns into %s namespace", ns->type); + if (flag == CLONE_NEWUSER) { + if (setresuid(0, 0, 0) < 0) + bail("failed to become root in user namespace"); + } + close(ns->fd); }

Based on result of inode_owner_or_capable in linux, it looks like we use wrong fsuid to create /dev/, which isn't align with user mapping.

I am trying this in containerd/containerd#10893 and will try locally, too.

@lifubang I'm using Ubuntu 22'04 with 6.5 kernel. Would you mind to use retsnoop to show the stack trace ? Thanks

I don't verify it under rootless scenario. Sorry

Works for me locally in Fedora 40 Vagrant with SELinux enabled

[root@localhost containerd]# runc --version runc version 1.2.0+dev commit: v1.0.0-1706-g7169f2d9 spec: 1.2.0 go: go1.23.2 libseccomp: 2.5.5 [root@localhost containerd]# sestatus SELinux status: enabled SELinuxfs mount: /sys/fs/selinux SELinux root directory: /etc/selinux Loaded policy name: targeted Current mode: enforcing Mode from config file: enforcing Policy MLS status: enabled Policy deny_unknown status: allowed Memory protection checking: actual (secure) Max kernel policy version: 33 [root@localhost containerd]# CONTAINERD_RUNTIME=io.containerd.runc.v2 FOCUS=TestPodUserNS ./script/test/cri-integration.sh .... --- PASS: TestPodUserNS (5.24s) --- PASS: TestPodUserNS/volumes_permissions (1.32s) --- PASS: TestPodUserNS/fails_with_several_mappings (0.01s) --- PASS: TestPodUserNS/userns_uid_mapping (1.34s) --- PASS: TestPodUserNS/userns_gid_mapping (1.31s) --- PASS: TestPodUserNS/rootfs_permissions (1.25s) PASS

Oh, yes, it can work, I put the code in the wrong place.
So, we should change this PR to let the change as little as possible?

Changed.
Thanks all of you @fuweid, @kolyshkin, @rata, and @AkihiroSuda for test.

lifubang · 2024-10-24T22:22:31Z

nice finding!

Thanks, the thinking is from your clearly comments: #4466 (comment)

I'm unsure why we need to join the userns afterwards, though. Do you know why it is needed? Skimming through the man 7 user_namespaces I didn't see any mention

I think it just like what we do for clone new user ns, we should join namespaces before unshare new user ns, or else there may be some namespaces waiting join are not owned by the user namespace.

I guess we need to do the same?

After I see the crun code, I think it's not needed here, it's about hook run, I don't see which hook should be running when we are joining/unsharing namespaces. But maybe I'm wrong, I think if there is a bug to ref this, it will be more clear. Anyway, I think we can deal with it in a separate PR once we need it.

cyphar · 2024-10-25T01:41:39Z

Out of interest, does this patch also work?

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 74e15b96d5fe..51ee474b408a 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -444,9 +444,9 @@ void nl_free(struct nlconfig_t *config)
 	free(config->data);
 }
 
-void join_namespaces(char *nslist)
+int join_namespaces(char *nslist)
 {
-	int num = 0, i;
+	int num = 0, joined_mask = 0, i;
 	char *saveptr = NULL;
 	char *namespace = strtok_r(nslist, ",", &saveptr);
 	struct namespace_t {
@@ -505,10 +505,12 @@ void join_namespaces(char *nslist)
 		if (setns(ns->fd, flag) < 0)
 			bail("failed to setns into %s namespace", ns->type);
 
+		joined_mask |= flag;
 		close(ns->fd);
 	}
 
 	free(namespaces);
+	return joined_mask;
 }
 
 static inline int sane_kill(pid_t pid, int signum)
@@ -829,6 +831,7 @@ void nsexec(void)
 	case STAGE_CHILD:{
 			pid_t stage2_pid = -1;
 			enum sync_t s;
+			int joined_mask = 0;
 
 			/* For debugging. */
 			current_stage = STAGE_CHILD;
@@ -849,7 +852,7 @@ void nsexec(void)
 			 * using cmsg(3) but that's just annoying.
 			 */
 			if (config.namespaces)
-				join_namespaces(config.namespaces);
+				joined_mask = join_namespaces(config.namespaces);
 
 			/*
 			 * Deal with user namespaces first. They are quite special, as they
@@ -907,7 +910,13 @@ void nsexec(void)
 					if (prctl(PR_SET_DUMPABLE, 0, 0, 0, 0) < 0)
 						bail("failed to re-set process as non-dumpable");
 				}
+			}
 
+			/*
+			 * If we swapped user namespaces at all, become root in the
+			 * namespace proper.
+			 */
+			if ((joined_mask | config.cloneflags) & CLONE_NEWUSER) {
 				/* Become root in the namespace proper. */
 				if (setresuid(0, 0, 0) < 0)
 					bail("failed to become root in user namespace");

This makes it somewhat clearer that the bug was that we forgot to setresuid in the CLONE_NEWUSER case and unifies the logic.

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

Also we need to adjust join_namespaces in a future patch to try to join the namespaces twice (to match nsenter -- see #4390) and the joined_mask is going to useful for that. But that's not too important.

libcontainer/nsenter/nsexec.c

lifubang · 2024-10-25T01:54:11Z

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

SGTM, I'll try it later.

lifubang · 2024-10-25T02:16:07Z

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

SGTM, I'll try it later.

Unfortunately, it can't fix this issue.

kolyshkin · 2024-10-25T04:49:21Z

It would be nice to have a test case. I'll try to come up with one tomorrow. We test on Fedora (not sure about selinux though) so we may be able to repro it.

cyphar · 2024-10-25T05:03:10Z

Yeah, I think doing setresuid immediately is probably a better solution anyway... Sorry for the noise.

Containerd pre-creates userns and netns before calling runc, which results in the current code not working when SELinux is enabled, resulting in the following error: > runc create failed: unable to start container process: error during container init: error mounting "mqueue" to rootfs at "/dev/mqueue": setxattr /path/to/rootfs/dev/mqueue: operation not permitted The solution is to become root in the user namespace right after we join it. Fixes opencontainers#4466. Co-authored-by: Wei Fu <[email protected]> Co-authored-by: Kir Kolyshkin <[email protected]> Co-authored-by: Aleksa Sarai <[email protected]> Signed-off-by: lifubang <[email protected]>

Signed-off-by: lifubang <[email protected]>

lifubang · 2024-10-25T06:16:15Z

It would be nice to have a test case.

Has add one test, I can confirm that this test is similar to #4466, it will be failure without this patch.

cyphar

LGTM. Thanks @lifubang!

cyphar · 2024-10-25T07:19:52Z

I'm quite surprised we never saw this issue before -- AFAICS this bug (not setting the uid/gid after joining) has existed for a long time (in 2016, when I did the somewhat-large rewrite of nsexec.c in 2cd9c31 I copied the old logic that had this bug as well).

I guess we didn't see it because it is very specific to mqueue (mqueue has a per-namespace superblock that is allocated at CLONE_NEWIPC time, and so the credentials you have when you do unshare(CLONE_NEWIPC) influence a bunch of attributes of the mount) but I would've expected it to pop up before. Since this is only an issue with newer SELinux-heavy distros, maybe there was a config change or kernel change that caused this behaviour to become a problem? Odd. (Or alternatively, RedHat now prefers crun and so they don't test with runc anymore?)

fuweid · 2024-10-25T16:54:27Z

Thank you @lifubang !

kolyshkin · 2024-10-25T18:45:38Z

I guess we didn't see it because it is very specific to mqueue (mqueue has a per-namespace superblock that is allocated at CLONE_NEWIPC time, and so the credentials you have when you do unshare(CLONE_NEWIPC) influence a bunch of attributes of the mount) but I would've expected it to pop up before. Since this is only an issue with newer SELinux-heavy distros, maybe there was a config change or kernel change that caused this behaviour to become a problem? Odd.

For some reason, runc 1.1.15 works fine (this bug was discovered in containerd/containerd#10877 which at the time was a mere runc bump to 1.2.0, and traced to 2 recent containerd commits listed in #4466 description).

I took a look at the diff between nexec.c in release-1.1 and main, but couldn't think of what could have caused the issue.

(Or alternatively, RedHat now prefers crun and so they don't test with runc anymore?)

AFAICS, runc is still being tested and used actively

rata · 2024-11-01T14:13:00Z

@kolyshkin runc 1.1.15 "works fine" just because the userns tests are skipped. runc 1.1.x doesn't support idmap mounts, and that is required for userns support in Kubernetes. So we just skip the tests if "runc features" doesn't include support for idmap mounts.

lifubang force-pushed the fix-join-userns-selinux branch 2 times, most recently from 277bed6 to 3d40be8 Compare October 24, 2024 15:15

lifubang mentioned this pull request Oct 24, 2024

containerd's TestPodUserNS fails with runc v1.2 (succeeds with crun) on SELinux distro: setxattr /[...]/dev/mqueue: operation not permitted #4466

Closed

lifubang requested review from cyphar, kolyshkin, AkihiroSuda and rata October 24, 2024 15:39

lifubang added this to the 1.2.1 milestone Oct 24, 2024

lifubang added the backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 label Oct 24, 2024

AkihiroSuda mentioned this pull request Oct 24, 2024

update runc binary to 1.2.1 containerd/containerd#10877

Merged

rata reviewed Oct 24, 2024

View reviewed changes

AkihiroSuda added area/selinux SELinux area/userns User Namespaces labels Oct 24, 2024

AkihiroSuda approved these changes Oct 24, 2024

View reviewed changes

kolyshkin reviewed Oct 24, 2024

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

fuweid reviewed Oct 24, 2024

View reviewed changes

lifubang force-pushed the fix-join-userns-selinux branch from 3d40be8 to 09d2ef1 Compare October 24, 2024 22:04

lifubang changed the title ~~Should become root in the namespace after joined user ns~~ Adjust user ns join order and become root in the ns after joined user ns Oct 24, 2024

kolyshkin mentioned this pull request Oct 24, 2024

[1.2] libct/nsenter: become root after joining userns #4474

Closed

lifubang force-pushed the fix-join-userns-selinux branch from 09d2ef1 to 1771ca7 Compare October 25, 2024 01:26

lifubang changed the title ~~Adjust user ns join order and become root in the ns after joined user ns~~ libct/nsenter: become root after joining userns Oct 25, 2024

lifubang force-pushed the fix-join-userns-selinux branch from 1771ca7 to 5774a6f Compare October 25, 2024 01:38

cyphar reviewed Oct 25, 2024

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

lifubang force-pushed the fix-join-userns-selinux branch from 5774a6f to dda7066 Compare October 25, 2024 02:15

lifubang force-pushed the fix-join-userns-selinux branch from dda7066 to fc30606 Compare October 25, 2024 02:17

lifubang and others added 2 commits October 25, 2024 13:40

test join other container userns with selinux enabled

34a9285

Signed-off-by: lifubang <[email protected]>

lifubang force-pushed the fix-join-userns-selinux branch from fc30606 to 34a9285 Compare October 25, 2024 06:01

cyphar approved these changes Oct 25, 2024

View reviewed changes

cyphar merged commit 22106a4 into opencontainers:main Oct 25, 2024
42 checks passed

This was referenced Oct 25, 2024

[test] checking "userns join other container userns[selinux enabled]" #4476

Closed

[1.2] libct/nsenter: become root after joining userns #4477

Merged

kolyshkin added backport/1.2-done A PR in main branch which has been backported to release-1.2 and removed backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 labels Oct 26, 2024

lifubang deleted the fix-join-userns-selinux branch October 26, 2024 05:13

kolyshkin removed this from the 1.2.1 milestone Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libct/nsenter: become root after joining userns #4473

libct/nsenter: become root after joining userns #4473

lifubang commented Oct 24, 2024 •

edited

Loading

rata left a comment •

edited

Loading

AkihiroSuda left a comment

fuweid Oct 24, 2024 •

edited

Loading

lifubang Oct 24, 2024

fuweid Oct 24, 2024 •

edited

Loading

fuweid Oct 24, 2024

fuweid Oct 24, 2024

kolyshkin Oct 24, 2024

fuweid Oct 24, 2024 •

edited

Loading

kolyshkin Oct 25, 2024

lifubang Oct 25, 2024

lifubang Oct 25, 2024 •

edited

Loading

lifubang commented Oct 24, 2024 •

edited

Loading

cyphar commented Oct 25, 2024 •

edited

Loading

lifubang commented Oct 25, 2024

lifubang commented Oct 25, 2024

kolyshkin commented Oct 25, 2024

cyphar commented Oct 25, 2024

lifubang commented Oct 25, 2024 •

edited

Loading

cyphar left a comment

cyphar commented Oct 25, 2024 •

edited

Loading

fuweid commented Oct 25, 2024

kolyshkin commented Oct 25, 2024

rata commented Nov 1, 2024

libct/nsenter: become root after joining userns #4473

libct/nsenter: become root after joining userns #4473

Conversation

lifubang commented Oct 24, 2024 • edited Loading

rata left a comment • edited Loading

Choose a reason for hiding this comment

AkihiroSuda left a comment

Choose a reason for hiding this comment

fuweid Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

lifubang Oct 24, 2024

Choose a reason for hiding this comment

fuweid Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

fuweid Oct 24, 2024

Choose a reason for hiding this comment

fuweid Oct 24, 2024

Choose a reason for hiding this comment

kolyshkin Oct 24, 2024

Choose a reason for hiding this comment

fuweid Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

kolyshkin Oct 25, 2024

Choose a reason for hiding this comment

lifubang Oct 25, 2024

Choose a reason for hiding this comment

lifubang Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

lifubang commented Oct 24, 2024 • edited Loading

cyphar commented Oct 25, 2024 • edited Loading

lifubang commented Oct 25, 2024

lifubang commented Oct 25, 2024

kolyshkin commented Oct 25, 2024

cyphar commented Oct 25, 2024

lifubang commented Oct 25, 2024 • edited Loading

cyphar left a comment

Choose a reason for hiding this comment

cyphar commented Oct 25, 2024 • edited Loading

fuweid commented Oct 25, 2024

kolyshkin commented Oct 25, 2024

rata commented Nov 1, 2024

lifubang commented Oct 24, 2024 •

edited

Loading

rata left a comment •

edited

Loading

fuweid Oct 24, 2024 •

edited

Loading

fuweid Oct 24, 2024 •

edited

Loading

fuweid Oct 24, 2024 •

edited

Loading

lifubang Oct 25, 2024 •

edited

Loading

lifubang commented Oct 24, 2024 •

edited

Loading

cyphar commented Oct 25, 2024 •

edited

Loading

lifubang commented Oct 25, 2024 •

edited

Loading

cyphar commented Oct 25, 2024 •

edited

Loading