Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'runc exec' errors with 'failed to setns into net namespace: Operation not permitted' #4390

Open
thundergolfer opened this issue Sep 1, 2024 · 5 comments · May be fixed by #4491 or #4492
Open

'runc exec' errors with 'failed to setns into net namespace: Operation not permitted' #4390

thundergolfer opened this issue Sep 1, 2024 · 5 comments · May be fixed by #4491 or #4492

Comments

@thundergolfer
Copy link

thundergolfer commented Sep 1, 2024

Description

At modal.com we run a custom multi-tenant container runtime which can use runc or runsc (gVisor). For us runsc exec is working but we're hitting a failure on doing runc exec which I've debugged for a long time and can't root cause.

Doing runc exec ta-01J5P4BZS64CE57EXK048QMNE1 bash fails because of EPERM on attempting to enter the runc container's network namespace.

Using sudo strace -ft runc exec -cap CAP_SYS_ADMIN ta-01J5P4BZS64CE57EXK048QMNE1 bash I can see that specifically it's failing on the setns syscall like this:

[pid 1021859] 20:17:39 setns(11, CLONE_NEWNET) = -1 EPERM (Operation not permitted

Oddly running sudo nsenter --all --target=267854 ls from the same terminal works. If I strace that command I can see that it makes the same syscalls as runc exec albeit in a different order.

17:30:23 openat(AT_FDCWD, "/proc/267854/ns/user", O_RDONLY) = 3
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/cgroup", O_RDONLY) = 4
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/ipc", O_RDONLY) = 5
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/uts", O_RDONLY) = 6
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/net", O_RDONLY) = 7
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/pid", O_RDONLY) = 8
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/mnt", O_RDONLY) = 9
17:30:23 setgroups(0, NULL)             = 0
17:30:23 setns(4, CLONE_NEWCGROUP)      = 0
17:30:23 close(4)                       = 0
17:30:23 setns(5, CLONE_NEWIPC)         = 0
17:30:23 close(5)                       = 0
17:30:23 setns(6, CLONE_NEWUTS)         = 0
17:30:23 close(6)                       = 0
17:30:23 setns(7, CLONE_NEWNET)         = 0
17:30:23 close(7)                       = 0
17:30:23 setns(8, CLONE_NEWPID)         = 0
17:30:23 close(8)                       = 0
17:30:23 setns(9, CLONE_NEWNS)          = 0
17:30:23 close(9)                       = 0
17:30:23 setns(3, CLONE_NEWUSER)        = 0
17:30:23 close(3)                       = 0

Things I've looked into:

  • Capabilities — I'm running with sudo so this shouldn't be a problem
  • Namespace hierarchy — looks correct
  • Seccomp — doesn't appear active
  • SELinux — is disabled
  • AppArmor — is disabled

I'm stuck on figuring out what's wrong here. My next move was going to be compiling my own runc to add debugging code into nsexec.c.

Steps to reproduce the issue

I fear this is tricky to reproduce, but I will provide details on what we're doing:

  1. From our container runtime running as root: runc --system-cgroup run ta-123 --bundle $BUNDLE_PATH
    a. config.json given below
  2. From a terminal on the same host: sudo runc --debug exec -c CAP_SYS_ADMIN ta-01J6NQG0GEHAQ07FTVHC4GAS64 ls

The container's network namespace is created from our container runtime with ip netns add ta-123 prior to container creation, and inside a CreateRuntime hook we use the CNI Bridge and Loopback plugins to setup lo and eth0.

config.json
{
  "annotations": {
    "org.systemd.property.IPAccounting": "true"
  },
  "hooks": {
    "createRuntime": [
      {
        "args": ["$OMITTED$"],
        "path": "/usr/bin/python3"
      }
    ],
    "postStop": [
      {
        "args": ["$OMITTED$"],
        "path": "/usr/bin/python3"
      }
    ]
  },
  "hostname": "modal",
  "linux": {
    "cgroupsPath": "modal.slice:container:ta-01J6NG5R94KSZSJAHD2XXMYXMS",
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/asound",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/sys/devices/virtual",
      "/sys/firmware",
      "/proc/scsi"
    ],
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "user"
      },
      {
        "type": "cgroup"
      },
      {
        "path": "/run/netns/ta-01J6NG5R94KSZSJAHD2XXMYXMS",
        "type": "network"
      }
    ],
    "readonlyPaths": [
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ],
    "resources": {
      "cpu": {
        "period": 100000,
        "quota": 412500,
        "shares": 128
      },
      "memory": {
        "reservation": 134217728
      }
    },
    "sysctl": {},
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ]
  },
  "mounts": [
    {
      "destination": "/proc",
      "source": "proc",
      "type": "proc"
    },
    {
      "destination": "/dev",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ],
      "source": "tmpfs",
      "type": "tmpfs"
    },
    {
      "destination": "/dev/pts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620"
      ],
      "source": "devpts",
      "type": "devpts"
    },
    {
      "destination": "/dev/shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ],
      "source": "shm",
      "type": "tmpfs"
    },
    {
      "destination": "/dev/mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ],
      "source": "mqueue",
      "type": "mqueue"
    },
    {
      "destination": "/sys",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro",
        "rbind"
      ],
      "source": "/sys",
      "type": "bind"
    },
    {
      "destination": "/sys/fs/cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ],
      "source": "cgroup",
      "type": "cgroup"
    },
    {
      "destination": "/etc/resolv.conf",
      "options": [
        "ro",
        "rbind",
        "rprivate",
        "nosuid",
        "noexec",
        "nodev"
      ],
      "source": "/opt/container-etc-resolv.conf",
      "type": "bind"
    },
    {
      "destination": "/run/modal.sock",
      "options": [
        "nosuid",
        "nodev",
        "noexec",
        "bind",
        "private"
      ],
      "source": "/run/modal-ta-01J6NG5R94KSZSJAHD2XXMYXMS-388233379.sock",
      "type": "bind"
    }
  ],
  "ociVersion": "1.0.2-dev",
  "process": {
    "args": [
      "/bin/dumb-init",
      "--",
      "python",
      "-u",
      "-R",
      "--check-hash-based-pycs",
      "never",
      "-m",
      "modal._container_entrypoint",
     "Ch10YS0wMUo2Tkc1Ujk0S1NaU0pBSEQyWFhNWVhNUxIZZnUtMGRmNTA1WFU4ZnFrZkp5a2RCQUV3bCIZYXAtd1JBaXM5UlhJWnpFcDhrRXhFT2lMeDqPAQoCZjESAWYaGW1vLVFBUnBqRDNvbDJhWFd1V0pyYUxBclIaGW1vLWNNQkhGWUJLTHZyemowN1lXeWM5NjQiGWltLWN1eWltWlBhSXR3SjFQamowTFB5OTc4AkACSgIiAKgBkE7yAQRydW5j2gIbChlpbS1jdXlpbVpQYUl0d0oxUGpqMExQeTk3wgMA+AMBagRtYWlu"
    ],
    "capabilities": {
      "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ]
    },
    "cwd": "/root",
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm",
      "SSL_CERT_DIR=/etc/ssl/certs",
      "SOURCE_DATE_EPOCH=1641013200",
      "PIP_NO_CACHE_DIR=off",
      "PYTHONHASHSEED=0",
      "PIP_ROOT_USER_ACTION=ignore",
      "CFLAGS=-g0",
      "PIP_DEFAULT_TIMEOUT=30",
      "BLIS_NUM_THREADS=1",
      "GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
      "LANG=C.UTF-8",
      "MKL_NUM_THREADS=1",
      "OMP_NUM_THREADS=1",
      "OPENBLAS_NUM_THREADS=1",
      "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "PYTHONPATH=/pkg/:/root/",
    ],
    "noNewPrivileges": true,
    "rlimits": [
      {
        "hard": 65536,
        "soft": 65536,
        "type": "RLIMIT_NOFILE"
      }
    ],
    "terminal": false,
    "user": {
      "gid": 0,
      "uid": 0
    }
  },
  "root": {
    "path": "/tmp/task-data-cTR0Dv/ta-01J6NG5R94KSZSJAHD2XXMYXMS/.tmpZue7Vk/rootfs",
    "readonly": false
  }
}

Describe the results you received and expected

I expect that runc exec will succeed, but it fails on entering the network namespace. Full failure:

sudo runc --debug exec -c CAP_SYS_ADMIN ta-01J6NQG0GEHAQ07FTVHC4GAS64 ip
DEBU[0000] nsexec[1889889]: => nsexec container setup
DEBU[0000] nsexec[1889889]: set process as non-dumpable
DEBU[0000] nsexec-0[1889889]: ~> nsexec stage-0
DEBU[0000] nsexec-0[1889889]: spawn stage-1
DEBU[0000] nsexec-0[1889889]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[1889891]: ~> nsexec stage-1
DEBU[0000] nsexec-1[1889891]: setns(0x10000000) into user namespace (with path /proc/1813667/ns/user)
DEBU[0000] nsexec-1[1889891]: setns(0x8000000) into ipc namespace (with path /proc/1813667/ns/ipc)
DEBU[0000] nsexec-1[1889891]: setns(0x4000000) into uts namespace (with path /proc/1813667/ns/uts)
DEBU[0000] nsexec-1[1889891]: setns(0x40000000) into net namespace (with path /proc/1813667/ns/net)
FATA[0000] nsexec-1[1889891]: failed to setns into net namespace: Operation not permitted
FATA[0000] nsexec-0[1889889]: failed to sync with stage-1: next state: Invalid argument

What version of runc are you using?

runc --version
runc version 1.7.19
commit: v1.1.13-0-g58aa920
spec: 1.0.2-dev
go: go1.21.12
libseccomp: 2.5.1

and

./runc.amd64 --version
runc version 1.1.13
commit: v1.1.13-0-g58aa9203-dirty
spec: 1.0.2-dev
go: go1.21.11
libseccomp: 2.5.5

Host OS information

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

But also reproduced on Oracle Linux as well.

Host kernel information

Linux ip-10-1-1-198 5.15.0-1068-aws #74~20.04.1-Ubuntu SMP Tue Aug 6 19:32:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@cyphar
Copy link
Member

cyphar commented Sep 2, 2024

The order you join namespaces is important. All namespaces have an associated user namespace that is considered its "owner" and all permission checks are done based on that namespace.

runc joins the user namespace first and then all other namespaces (this is necessary for rootless containers to work -- an unprivileged user can't create/join any namespace other than user namespaces, so you need to create/join the user namespace first).

However, once you create/join a user namespace your privileges are forcefully scoped to that namespace and so you no longer have host root privileges. Since you created the network namespace outside of the container (and thus outside of the container's user namespace) once we join the user namespace we no longer have privileges to join the container.

There are three things we can do:

  • runc could probably try joining the namespaces twice -- once before joining the user namespace and then once after joining the user namespace (this is what nsenter does, it seems). This would let us join namespaces regardless of whether we only have access to it before or after the userns join. I'll write a patch for this.

  • You can try to create your network namespace inside a user namespace and then get the container to join both. The only possible issue is that you probably do some setup with the network namespace and doing that with a user namespaced network namespace may be a little more complicated (you would probably need to run your setup process as a separate privileged process that only joins the network namespace so you have the privileges necessary to set up bridge networks or whatever else you need to do).

  • Depending on what your exact needs are, you might be able to use pre-start hooks to configure the network namespace without having to create it outside of runc (runc would create the namespaces and you would be able to run your setup code before the container starts).

@thundergolfer
Copy link
Author

thundergolfer commented Sep 2, 2024

Thank you for your excellent comment. I think I got blocked because I wasn't testing the path dependence of joining net after user and I didn't realize that nsenter was doing that 'join twice' behavior.

I'll pursue the direction you point at in your second bullet point. I think doing so will allow me to become more comfortable with the details of namespace sharing.

@cyphar
Copy link
Member

cyphar commented Sep 7, 2024

I'll send the patch in a week or two.

@rata
Copy link
Member

rata commented Sep 10, 2024

@thundergolfer a fourth option is to let runc create the userns and the netns. This way, runc makes sure to create them in the right order (so it has the right ownership) and it is quite simple for you. Combining this with the fact that runc has runc create and runc start, you can configure the network after create (that creates the namespaces and all, but doesn't start the container process) and then do the start and start the container process. I don't know what you are doing exactly with the netns, but maybe this works fine for you.

This is what we are doing in containerd. Although we might change it, maybe for 2.0, due to other redesigns in containerd that made a better fit to create the userns+netns in containerd. In case you are interested in that, what we are considering in containerd now is this: containerd/containerd#10607. IOW, let containerd create the userns AND netns and specify that to runc. This would be option 2.

The "trick" we are using there is to create a new process with the CLONE_NEWUSER and CLONE_NEWNET (https://github.com/containerd/containerd/pull/10607/files#diff-106945d93d68e955471ccab149a1302ebb7214c1832b7df0bbd8855992ddf397R49-R55). Linux does the right thing regarding ownership and that (there was a bug I think in very old kernels, like 3.x). We then open the fd of the namespace (open("/proc/pid/ns/net", similar for user) and mount it in the fs. This makes the namespace persistent (the process can crash and it won't be destroyed) and we use that path for the namespace in the config.json.

@cyphar
Copy link
Member

cyphar commented Sep 10, 2024

@rata That is the third option I suggested, though maybe I could've phrased it better 😅 . You can do the same thing with runc create or with hooks.

lifubang added a commit to lifubang/runc that referenced this issue Oct 29, 2024
We should join as many namespaces as possible first except the user namespace.
Then we can join remainning namespaces after we join/unshare user ns. (opencontainers#4390)

Signed-off-by: lifubang <[email protected]>
lifubang added a commit to lifubang/runc that referenced this issue Oct 30, 2024
We should join as many namespaces as possible first except the user namespace,
because there may be some ns paths are not owned by the user namespace we want
to join, then we can join remainning namespaces after we join/unshare user ns.
Please see opencontainers#4390.

Signed-off-by: lifubang <[email protected]>
lifubang added a commit to lifubang/runc that referenced this issue Oct 30, 2024
We should join as many namespaces as possible first except the user namespace,
because there may be some ns paths are not owned by the user namespace we want
to join, then we can join remainning namespaces after we join/unshare user ns.
Please see opencontainers#4390.

Signed-off-by: lifubang <[email protected]>
lifubang added a commit to lifubang/runc that referenced this issue Oct 30, 2024
We should join as many namespaces as possible first except the user namespace,
because there may be some ns paths are not owned by the user namespace we want
to join, then we can join remainning namespaces after we join/unshare user ns.
Please see opencontainers#4390.

Signed-off-by: lifubang <[email protected]>
lifubang added a commit to lifubang/runc that referenced this issue Oct 30, 2024
We should join as many namespaces as possible first except the user namespace,
because there may be some ns paths are not owned by the user namespace we want
to join, then we can join remainning namespaces after we join/unshare user ns.
Please see opencontainers#4390.

Signed-off-by: lifubang <[email protected]>
lifubang added a commit to lifubang/runc that referenced this issue Oct 30, 2024
We should join as many namespaces as possible first except the user namespace,
because there may be some ns paths are not owned by the user namespace we want
to join, then we can join remainning namespaces after we join/unshare user ns.
Please see opencontainers#4390.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang linked a pull request Oct 30, 2024 that will close this issue
@cyphar cyphar linked a pull request Oct 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants