-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/nsenter: become root after joining userns #4473
Conversation
277bed6
to
3d40be8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifubang nice finding! I haven't checked the code yet. But here are some general comments.
This seems to be what crun does too! Here it skips the userns: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L3808. And then it joins it: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4615-L4617
However, crun is also handling a case where the container needs to join a PID namespace: https://github.com/containers/crun/blob/2381713dd61e012eaf15458030eae627acd3c1c2/src/libcrun/linux.c#L4562-L4568. I guess we need to do the same?
I'm unsure why we need to join the userns afterwards, though. Do you know why it is needed? Skimming through the man 7 user_namespaces I didn't see any mention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
libcontainer/nsenter/nsexec.c
Outdated
@@ -501,6 +506,12 @@ void join_namespaces(char *nslist) | |||
struct namespace_t *ns = &namespaces[i]; | |||
int flag = nsflag(ns->type); | |||
|
|||
if (flag == CLONE_NEWUSER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to use retsnoop
to check and found this in v6.5 kernel
// retsnoop -e '*setxattr' -e 'inode_owner_or_capable' -e 'ns_capable' -E -n 'runc:[2:INIT]' -T
21:47:54.401485 -> 21:47:54.401526 TID/PID 69055/69055 (runc:[2:INIT]/runc:[2:INIT]):
FUNCTION CALLS RESULT DURATION
---------------------------------------------------- -------- --------
→ __x64_sys_setxattr
→ path_setxattr
→ setxattr
→ do_setxattr
→ vfs_setxattr
→ security_inode_setxattr
→ selinux_inode_setxattr
↔ inode_owner_or_capable [false] 1.100us
← selinux_inode_setxattr [-EPERM] 2.700us
← security_inode_setxattr [-EPERM] 8.900us
← vfs_setxattr [-EPERM] 15.300us
← do_setxattr [-EPERM] 21.400us
← setxattr [-EPERM] 24.000us
← path_setxattr [-EPERM] 29.500us
← __x64_sys_setxattr [-EPERM] 31.700us
Is it related to idmap here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the net ns path created in containerd is not owned by the user ns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the path is about rootfs/dev/mqueue instead of net ns.
The net ns is created by new pod user ns. (The pinned net ns path is created by containerd's user ns instead of pod user ns)
The patch works on my local. But it's still mystery to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your patch, the change works on my local as well.
diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 74e15b96..350a17fe 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -505,6 +505,11 @@ void join_namespaces(char *nslist)
if (setns(ns->fd, flag) < 0)
bail("failed to setns into %s namespace", ns->type);
+ if (flag == CLONE_NEWUSER) {
+ if (setresuid(0, 0, 0) < 0)
+ bail("failed to become root in user namespace");
+ }
+
close(ns->fd);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on result of inode_owner_or_capable
in linux, it looks like we use wrong fsuid to create /dev/
, which isn't align with user mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying this in containerd/containerd#10893 and will try locally, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifubang I'm using Ubuntu 22'04 with 6.5 kernel. Would you mind to use retsnoop to show the stack trace ? Thanks
I don't verify it under rootless scenario. Sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me locally in Fedora 40 Vagrant with SELinux enabled
[root@localhost containerd]# runc --version
runc version 1.2.0+dev
commit: v1.0.0-1706-g7169f2d9
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.5
[root@localhost containerd]# sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Memory protection checking: actual (secure)
Max kernel policy version: 33
[root@localhost containerd]# CONTAINERD_RUNTIME=io.containerd.runc.v2 FOCUS=TestPodUserNS ./script/test/cri-integration.sh
....
--- PASS: TestPodUserNS (5.24s)
--- PASS: TestPodUserNS/volumes_permissions (1.32s)
--- PASS: TestPodUserNS/fails_with_several_mappings (0.01s)
--- PASS: TestPodUserNS/userns_uid_mapping (1.34s)
--- PASS: TestPodUserNS/userns_gid_mapping (1.31s)
--- PASS: TestPodUserNS/rootfs_permissions (1.25s)
PASS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, it can work, I put the code in the wrong place.
So, we should change this PR to let the change as little as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Thanks all of you @fuweid, @kolyshkin, @rata, and @AkihiroSuda for test.
3d40be8
to
09d2ef1
Compare
Thanks, the thinking is from your clearly comments: #4466 (comment)
I think it just like what we do for clone new user ns, we should join namespaces before unshare new user ns, or else there may be some namespaces waiting join are not owned by the user namespace.
After I see the crun code, I think it's not needed here, it's about hook run, I don't see which hook should be running when we are joining/unsharing namespaces. But maybe I'm wrong, I think if there is a bug to ref this, it will be more clear. Anyway, I think we can deal with it in a separate PR once we need it. |
09d2ef1
to
1771ca7
Compare
1771ca7
to
5774a6f
Compare
Out of interest, does this patch also work? diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 74e15b96d5fe..51ee474b408a 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -444,9 +444,9 @@ void nl_free(struct nlconfig_t *config)
free(config->data);
}
-void join_namespaces(char *nslist)
+int join_namespaces(char *nslist)
{
- int num = 0, i;
+ int num = 0, joined_mask = 0, i;
char *saveptr = NULL;
char *namespace = strtok_r(nslist, ",", &saveptr);
struct namespace_t {
@@ -505,10 +505,12 @@ void join_namespaces(char *nslist)
if (setns(ns->fd, flag) < 0)
bail("failed to setns into %s namespace", ns->type);
+ joined_mask |= flag;
close(ns->fd);
}
free(namespaces);
+ return joined_mask;
}
static inline int sane_kill(pid_t pid, int signum)
@@ -829,6 +831,7 @@ void nsexec(void)
case STAGE_CHILD:{
pid_t stage2_pid = -1;
enum sync_t s;
+ int joined_mask = 0;
/* For debugging. */
current_stage = STAGE_CHILD;
@@ -849,7 +852,7 @@ void nsexec(void)
* using cmsg(3) but that's just annoying.
*/
if (config.namespaces)
- join_namespaces(config.namespaces);
+ joined_mask = join_namespaces(config.namespaces);
/*
* Deal with user namespaces first. They are quite special, as they
@@ -907,7 +910,13 @@ void nsexec(void)
if (prctl(PR_SET_DUMPABLE, 0, 0, 0, 0) < 0)
bail("failed to re-set process as non-dumpable");
}
+ }
+ /*
+ * If we swapped user namespaces at all, become root in the
+ * namespace proper.
+ */
+ if ((joined_mask | config.cloneflags) & CLONE_NEWUSER) {
/* Become root in the namespace proper. */
if (setresuid(0, 0, 0) < 0)
bail("failed to become root in user namespace"); This makes it somewhat clearer that the bug was that we forgot to I don't think we need to do Also we need to adjust |
SGTM, I'll try it later. |
5774a6f
to
dda7066
Compare
Unfortunately, it can't fix this issue. |
dda7066
to
fc30606
Compare
It would be nice to have a test case. I'll try to come up with one tomorrow. We test on Fedora (not sure about selinux though) so we may be able to repro it. |
Yeah, I think doing |
Containerd pre-creates userns and netns before calling runc, which results in the current code not working when SELinux is enabled, resulting in the following error: > runc create failed: unable to start container process: error during container init: error mounting "mqueue" to rootfs at "/dev/mqueue": setxattr /path/to/rootfs/dev/mqueue: operation not permitted The solution is to become root in the user namespace right after we join it. Fixes opencontainers#4466. Co-authored-by: Wei Fu <[email protected]> Co-authored-by: Kir Kolyshkin <[email protected]> Co-authored-by: Aleksa Sarai <[email protected]> Signed-off-by: lifubang <[email protected]>
Signed-off-by: lifubang <[email protected]>
fc30606
to
34a9285
Compare
Has add one test, I can confirm that this test is similar to #4466, it will be failure without this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @lifubang!
I'm quite surprised we never saw this issue before -- AFAICS this bug (not setting the uid/gid after joining) has existed for a long time (in 2016, when I did the somewhat-large rewrite of I guess we didn't see it because it is very specific to |
Thank you @lifubang ! |
For some reason, runc 1.1.15 works fine (this bug was discovered in containerd/containerd#10877 which at the time was a mere runc bump to 1.2.0, and traced to 2 recent containerd commits listed in #4466 description). I took a look at the diff between nexec.c in release-1.1 and main, but couldn't think of what could have caused the issue.
AFAICS, runc is still being tested and used actively |
@kolyshkin runc 1.1.15 "works fine" just because the userns tests are skipped. runc 1.1.x doesn't support idmap mounts, and that is required for userns support in Kubernetes. So we just skip the tests if "runc features" doesn't include support for idmap mounts. |
Containerd pre-creates userns and netns before calling runc, which
results in the current code not working when SELinux is enabled,
resulting in the following error:
The solution is to become root in the user namespace right after
we join it.
Fixes #4466.
Co-authored-by: Wei Fu [email protected]
Co-authored-by: Kir Kolyshkin [email protected]
Co-authored-by: Aleksa Sarai [email protected]
Signed-off-by: lifubang [email protected]