Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/nsenter: become root after joining userns #4473

Merged
merged 2 commits into from
Oct 25, 2024

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented Oct 24, 2024

Containerd pre-creates userns and netns before calling runc, which
results in the current code not working when SELinux is enabled,
resulting in the following error:

runc create failed: unable to start container process: error during
container init: error mounting "mqueue" to rootfs at "/dev/mqueue":
setxattr /path/to/rootfs/dev/mqueue: operation not permitted

The solution is to become root in the user namespace right after
we join it.

Fixes #4466.

Co-authored-by: Wei Fu [email protected]
Co-authored-by: Kir Kolyshkin [email protected]
Co-authored-by: Aleksa Sarai [email protected]
Signed-off-by: lifubang [email protected]

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang nice finding! I haven't checked the code yet. But here are some general comments.

This seems to be what crun does too! Here it skips the userns: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L3808. And then it joins it: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4615-L4617

However, crun is also handling a case where the container needs to join a PID namespace: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4562-L4568. I guess we need to do the same?

I'm unsure why we need to join the userns afterwards, though. Do you know why it is needed? Skimming through the man 7 user_namespaces I didn't see any mention

@AkihiroSuda AkihiroSuda added area/selinux SELinux area/userns User Namespaces labels Oct 24, 2024
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -501,6 +506,12 @@ void join_namespaces(char *nslist)
struct namespace_t *ns = &namespaces[i];
int flag = nsflag(ns->type);

if (flag == CLONE_NEWUSER) {
Copy link
Member

@fuweid fuweid Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to use retsnoop to check and found this in v6.5 kernel

//  retsnoop -e '*setxattr' -e 'inode_owner_or_capable' -e 'ns_capable' -E -n 'runc:[2:INIT]' -T


21:47:54.401485 -> 21:47:54.401526 TID/PID 69055/69055 (runc:[2:INIT]/runc:[2:INIT]):

FUNCTION CALLS                                         RESULT    DURATION
----------------------------------------------------   --------  --------
→ __x64_sys_setxattr
    → path_setxattr
        → setxattr
            → do_setxattr
                → vfs_setxattr
                    → security_inode_setxattr
                        → selinux_inode_setxattr
                            ↔ inode_owner_or_capable   [false]    1.100us
                        ← selinux_inode_setxattr       [-EPERM]   2.700us
                    ← security_inode_setxattr          [-EPERM]   8.900us
                ← vfs_setxattr                         [-EPERM]  15.300us
            ← do_setxattr                              [-EPERM]  21.400us
        ← setxattr                                     [-EPERM]  24.000us
    ← path_setxattr                                    [-EPERM]  29.500us
← __x64_sys_setxattr                                   [-EPERM]  31.700us

https://github.com/torvalds/linux/blob/2dde18cd1d8fac735875f2e4987f11817cc0bc2c/security/selinux/hooks.c#L3151

Is it related to idmap here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the net ns path created in containerd is not owned by the user ns?

Copy link
Member

@fuweid fuweid Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path is about rootfs/dev/mqueue instead of net ns.
The net ns is created by new pod user ns. (The pinned net ns path is created by containerd's user ns instead of pod user ns)

The patch works on my local. But it's still mystery to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your patch, the change works on my local as well.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 74e15b96..350a17fe 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -505,6 +505,11 @@ void join_namespaces(char *nslist)
                if (setns(ns->fd, flag) < 0)
                        bail("failed to setns into %s namespace", ns->type);

+               if (flag == CLONE_NEWUSER) {
+                       if (setresuid(0, 0, 0) < 0)
+                               bail("failed to become root in user namespace");
+               }
+
                close(ns->fd);
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on result of inode_owner_or_capable in linux, it looks like we use wrong fsuid to create /dev/, which isn't align with user mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying this in containerd/containerd#10893 and will try locally, too.

Copy link
Member

@fuweid fuweid Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang I'm using Ubuntu 22'04 with 6.5 kernel. Would you mind to use retsnoop to show the stack trace ? Thanks


I don't verify it under rootless scenario. Sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me locally in Fedora 40 Vagrant with SELinux enabled

[root@localhost containerd]# runc --version
runc version 1.2.0+dev
commit: v1.0.0-1706-g7169f2d9
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.5
[root@localhost containerd]# sestatus 
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33
[root@localhost containerd]# CONTAINERD_RUNTIME=io.containerd.runc.v2 FOCUS=TestPodUserNS ./script/test/cri-integration.sh
....
--- PASS: TestPodUserNS (5.24s)
    --- PASS: TestPodUserNS/volumes_permissions (1.32s)
    --- PASS: TestPodUserNS/fails_with_several_mappings (0.01s)
    --- PASS: TestPodUserNS/userns_uid_mapping (1.34s)
    --- PASS: TestPodUserNS/userns_gid_mapping (1.31s)
    --- PASS: TestPodUserNS/rootfs_permissions (1.25s)
PASS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, it can work, I put the code in the wrong place.
So, we should change this PR to let the change as little as possible?

Copy link
Member Author

@lifubang lifubang Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.
Thanks all of you @fuweid, @kolyshkin, @rata, and @AkihiroSuda for test.

@lifubang lifubang changed the title Should become root in the namespace after joined user ns Adjust user ns join order and become root in the ns after joined user ns Oct 24, 2024
@lifubang
Copy link
Member Author

lifubang commented Oct 24, 2024

nice finding!

Thanks, the thinking is from your clearly comments: #4466 (comment)

I'm unsure why we need to join the userns afterwards, though. Do you know why it is needed? Skimming through the man 7 user_namespaces I didn't see any mention

I think it just like what we do for clone new user ns, we should join namespaces before unshare new user ns, or else there may be some namespaces waiting join are not owned by the user namespace.

I guess we need to do the same?

After I see the crun code, I think it's not needed here, it's about hook run, I don't see which hook should be running when we are joining/unsharing namespaces. But maybe I'm wrong, I think if there is a bug to ref this, it will be more clear. Anyway, I think we can deal with it in a separate PR once we need it.

@lifubang lifubang changed the title Adjust user ns join order and become root in the ns after joined user ns libct/nsenter: become root after joining userns Oct 25, 2024
@cyphar
Copy link
Member

cyphar commented Oct 25, 2024

Out of interest, does this patch also work?

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 74e15b96d5fe..51ee474b408a 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -444,9 +444,9 @@ void nl_free(struct nlconfig_t *config)
 	free(config->data);
 }
 
-void join_namespaces(char *nslist)
+int join_namespaces(char *nslist)
 {
-	int num = 0, i;
+	int num = 0, joined_mask = 0, i;
 	char *saveptr = NULL;
 	char *namespace = strtok_r(nslist, ",", &saveptr);
 	struct namespace_t {
@@ -505,10 +505,12 @@ void join_namespaces(char *nslist)
 		if (setns(ns->fd, flag) < 0)
 			bail("failed to setns into %s namespace", ns->type);
 
+		joined_mask |= flag;
 		close(ns->fd);
 	}
 
 	free(namespaces);
+	return joined_mask;
 }
 
 static inline int sane_kill(pid_t pid, int signum)
@@ -829,6 +831,7 @@ void nsexec(void)
 	case STAGE_CHILD:{
 			pid_t stage2_pid = -1;
 			enum sync_t s;
+			int joined_mask = 0;
 
 			/* For debugging. */
 			current_stage = STAGE_CHILD;
@@ -849,7 +852,7 @@ void nsexec(void)
 			 * using cmsg(3) but that's just annoying.
 			 */
 			if (config.namespaces)
-				join_namespaces(config.namespaces);
+				joined_mask = join_namespaces(config.namespaces);
 
 			/*
 			 * Deal with user namespaces first. They are quite special, as they
@@ -907,7 +910,13 @@ void nsexec(void)
 					if (prctl(PR_SET_DUMPABLE, 0, 0, 0, 0) < 0)
 						bail("failed to re-set process as non-dumpable");
 				}
+			}
 
+			/*
+			 * If we swapped user namespaces at all, become root in the
+			 * namespace proper.
+			 */
+			if ((joined_mask | config.cloneflags) & CLONE_NEWUSER) {
 				/* Become root in the namespace proper. */
 				if (setresuid(0, 0, 0) < 0)
 					bail("failed to become root in user namespace");

This makes it somewhat clearer that the bug was that we forgot to setresuid in the CLONE_NEWUSER case and unifies the logic.

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

Also we need to adjust join_namespaces in a future patch to try to join the namespaces twice (to match nsenter -- see #4390) and the joined_mask is going to useful for that. But that's not too important.

@lifubang
Copy link
Member Author

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

SGTM, I'll try it later.

@lifubang
Copy link
Member Author

I don't think we need to do setresuid immediately after joining because namespace joining is based on capabilities and you inherit the right capabilities when you join -- but maybe setresuid immediately after is clearer... Hmmm...

SGTM, I'll try it later.

Unfortunately, it can't fix this issue.

@kolyshkin
Copy link
Contributor

It would be nice to have a test case. I'll try to come up with one tomorrow. We test on Fedora (not sure about selinux though) so we may be able to repro it.

@cyphar
Copy link
Member

cyphar commented Oct 25, 2024

Yeah, I think doing setresuid immediately is probably a better solution anyway... Sorry for the noise.

lifubang and others added 2 commits October 25, 2024 13:40
Containerd pre-creates userns and netns before calling runc, which
results in the current code not working when SELinux is enabled,
resulting in the following error:

> runc create failed: unable to start container process: error during
container init: error mounting "mqueue" to rootfs at "/dev/mqueue":
setxattr /path/to/rootfs/dev/mqueue: operation not permitted

The solution is to become root in the user namespace right after
we join it.

Fixes opencontainers#4466.

Co-authored-by: Wei Fu <[email protected]>
Co-authored-by: Kir Kolyshkin <[email protected]>
Co-authored-by: Aleksa Sarai <[email protected]>
Signed-off-by: lifubang <[email protected]>
@lifubang
Copy link
Member Author

lifubang commented Oct 25, 2024

It would be nice to have a test case.

Has add one test, I can confirm that this test is similar to #4466, it will be failure without this patch.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @lifubang!

@cyphar
Copy link
Member

cyphar commented Oct 25, 2024

I'm quite surprised we never saw this issue before -- AFAICS this bug (not setting the uid/gid after joining) has existed for a long time (in 2016, when I did the somewhat-large rewrite of nsexec.c in 2cd9c31 I copied the old logic that had this bug as well).

I guess we didn't see it because it is very specific to mqueue (mqueue has a per-namespace superblock that is allocated at CLONE_NEWIPC time, and so the credentials you have when you do unshare(CLONE_NEWIPC) influence a bunch of attributes of the mount) but I would've expected it to pop up before. Since this is only an issue with newer SELinux-heavy distros, maybe there was a config change or kernel change that caused this behaviour to become a problem? Odd. (Or alternatively, RedHat now prefers crun and so they don't test with runc anymore?)

@cyphar cyphar merged commit 22106a4 into opencontainers:main Oct 25, 2024
42 checks passed
@fuweid
Copy link
Member

fuweid commented Oct 25, 2024

Thank you @lifubang !

@kolyshkin
Copy link
Contributor

I guess we didn't see it because it is very specific to mqueue (mqueue has a per-namespace superblock that is allocated at CLONE_NEWIPC time, and so the credentials you have when you do unshare(CLONE_NEWIPC) influence a bunch of attributes of the mount) but I would've expected it to pop up before. Since this is only an issue with newer SELinux-heavy distros, maybe there was a config change or kernel change that caused this behaviour to become a problem? Odd.

For some reason, runc 1.1.15 works fine (this bug was discovered in containerd/containerd#10877 which at the time was a mere runc bump to 1.2.0, and traced to 2 recent containerd commits listed in #4466 description).

I took a look at the diff between nexec.c in release-1.1 and main, but couldn't think of what could have caused the issue.

(Or alternatively, RedHat now prefers crun and so they don't test with runc anymore?)

AFAICS, runc is still being tested and used actively

@kolyshkin kolyshkin added backport/1.2-done A PR in main branch which has been backported to release-1.2 and removed backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 labels Oct 26, 2024
@lifubang lifubang deleted the fix-join-userns-selinux branch October 26, 2024 05:13
@kolyshkin kolyshkin removed this from the 1.2.1 milestone Oct 28, 2024
@rata
Copy link
Member

rata commented Nov 1, 2024

@kolyshkin runc 1.1.15 "works fine" just because the userns tests are skipped. runc 1.1.x doesn't support idmap mounts, and that is required for userns support in Kubernetes. So we just skip the tests if "runc features" doesn't include support for idmap mounts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/selinux SELinux area/userns User Namespaces backport/1.2-done A PR in main branch which has been backported to release-1.2
Projects
None yet
6 participants