Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "partial" isolation #20073

Closed
rockdrilla opened this issue Sep 20, 2023 · 3 comments · Fixed by #20086
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@rockdrilla
Copy link

rockdrilla commented Sep 20, 2023

Issue Description

Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "partial" isolation.

$ podman run --rm --network=host docker.io/library/debian ls -l /sys/fs/cgroup
total 0
-r--r--r--  1 nobody nogroup 0 Sep 19 15:15 cgroup.controllers
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.max.depth
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.max.descendants
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.pressure
-rw-r--r--  1 nobody nogroup 0 Sep 19 15:15 cgroup.procs
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.stat
-rw-r--r--  1 nobody nogroup 0 Sep 20 22:26 cgroup.subtree_control
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.threads
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cpu.pressure
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cpu.stat
-r--r--r--  1 nobody nogroup 0 Sep 19 15:19 cpuset.cpus.effective
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cpuset.mems.effective
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 dev-hugepages.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 dev-mqueue.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 init.scope
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.cost.model
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.cost.qos
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.pressure
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.prio.class
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 io.stat
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:26 machine.slice
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.numa_stat
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.pressure
--w-------  1 nobody nogroup 0 Sep 20 21:01 memory.reclaim
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.stat
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 misc.capacity
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 misc.current
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-fs-fuse-connections.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-config.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-debug.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-tracing.mount
drwxr-xr-x 37 nobody nogroup 0 Sep 20 22:33 system.slice
drwxr-xr-x  3 nobody nogroup 0 Sep 20 22:14 user.slice

Correct behavior (achieved with --systemd=always):

$ podman run --rm --network=host --systemd=always docker.io/library/debian ls -l /sys/fs/cgroup
total 0
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.controllers
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.freeze
--w------- 1 root root 0 Sep 20 22:34 cgroup.kill
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.max.depth
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.pressure
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.procs
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.threads
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.type
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.idle
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.max.burst
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.pressure
-r--r--r-- 1 root root 0 Sep 20 22:34 cpu.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.uclamp.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.uclamp.min
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.weight
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.weight.nice
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus
-r--r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.mems
-r--r--r-- 1 root root 0 Sep 20 22:34 cpuset.mems.effective
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.bfq.weight
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.latency
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.pressure
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.prio.class
-r--r--r-- 1 root root 0 Sep 20 22:34 io.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.weight
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.current
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.events
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.events.local
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.high
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.low
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.min
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.numa_stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.oom.group
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.peak
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.pressure
--w------- 1 root root 0 Sep 20 22:34 memory.reclaim
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.stat
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.current
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.swap.high
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.swap.max
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.peak
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.zswap.current
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.zswap.max
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.current
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 pids.max
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.peak

Hovewer, /proc/self/mountinfo and /proc/self/cgroup look "sane" (but they're not).

$ podman run --rm --network=host docker.io/library/debian sh -ec 'cat /proc/self/cgroup ; echo ; grep cgroup /proc/self/mountinfo'
0::/

582 580 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
597 582 0:26 /../../../../../.. /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot

Correct behavior:

$ podman run --rm --network=host --systemd=always docker.io/library/debian sh -ec 'cat /proc/self/cgroup ; echo ; grep cgroup /proc/self/mountinfo'
0::/

584 582 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
601 584 0:79 / /sys/fs/cgroup rw,relatime - tmpfs tmpfs rw,size=4k,nr_inodes=1,uid=1000,gid=1000,inode64
602 601 0:26 / /sys/fs/cgroup rw,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot

Steps to reproduce the issue

Steps to reproduce the issue

  1. run container with partial isolation (e.g. --network=host) and with systemd in "auto" mode (i.e. not specifying --systemd=always).
  2. inspect /sys/fs/cgroup/.

Example:

command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"

Describe the results you received

/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/container/memory.max:2147483648
/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/memory.max:2147483648

Describe the results you expected

/sys/fs/cgroup/memory.max:2147483648

podman info output

host:
  arch: amd64
  buildahVersion: 1.31.2
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.6+ds1-1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: unknown'
  cpuUtilization:
    idlePercent: 98.05
    systemPercent: 0.4
    userPercent: 1.55
  cpus: 12
  databaseBackend: boltdb
  distribution:
    codename: trixie
    distribution: debian
    version: unknown
  eventLogger: file
  freeLocks: 2029
  hostname: lenovatio
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.5.4-1-mobile
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 13487517696
  memTotal: 31386062848
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: golang-github-containernetworking-plugin-dnsname_1.3.1+ds1-2+b8_amd64
      path: /usr/lib/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
        CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
    package: 'golang-github-containernetworking-plugin-dnsname, containernetworking-plugins:
      /usr/lib/cni'
    path: /usr/lib/cni
  ociRuntime:
    name: crun
    package: crun_1.9-1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.9
      commit: a538ac4ea1ff319bcfe2bf81cb5c6f687e2dc9d3
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.1-1_amd64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 30h 55m 46.00s (Approximately 1.25 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  127.0.0.1:8080:
    Blocked: false
    Insecure: true
    Location: 127.0.0.1:8080
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: 127.0.0.1:8080
    PullFromMirror: ""
  127.0.0.1:8082:
    Blocked: false
    Insecure: true
    Location: 127.0.0.1:8082
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: 127.0.0.1:8082
    PullFromMirror: ""
store:
  configFile: /home/krd/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs_1.10-1_amd64
      Version: |-
        fusermount3 version: 3.14.0
        fuse-overlayfs: version 1.10
        FUSE library version 3.14.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/krd/.local/share/containers/storage
  graphRootAllocated: 485560172544
  graphRootUsed: 370000158720
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /tmp/user/1000
  imageStore:
    number: 154
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/krd/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.2
  Built: 0
  BuiltTime: Thu Jan  1 03:00:00 1970
  GitCommit: ""
  GoVersion: go1.21.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Running rootless Podman:

$ command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

$ podman run --rm -m 2G docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

$ podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/container/memory.max:2147483648
/sys/fs/cgroup/user.slice/user-1000.slice/[email protected]/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/memory.max:2147483648

$ podman run --rm -m 2G --network=host --systemd=always docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

Running rootful Podman:

# command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

# podman run --rm -m 2G docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

# podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

# podman run --rm -m 2G --network=host --systemd=always docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648
@rockdrilla rockdrilla added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2023
@rockdrilla
Copy link
Author

We're already hit by this issue, e.g. nginxinc/docker-nginx#701.

@rockdrilla rockdrilla changed the title Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "fragile" isolation Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "partial" isolation Sep 20, 2023
@Luap99
Copy link
Member

Luap99 commented Sep 21, 2023

@giuseppe PTAL

giuseppe added a commit to giuseppe/libpod that referenced this issue Sep 21, 2023
commit cf36470 changed the way
/sys/fs/cgroup is mounted when there is not a netns and it now honors
the ro flag.  The mount was created using a bind mount that is a
problem when using a cgroup namespace, fix that by mounting a fresh
cgroup file system.

Closes: containers#20073

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

thanks, opened a PR: #20086

Please be aware that it fixes only the cgroup mounted on the top of /sys/fs/cgroup. The previous /sys/fs/cgroup coming from the host will still be visible in /proc/self/mountinfo. There is no way to address that because without a netns we cannot mount a fresh sysfs and we are forced to bind mount it from the host. Unprivileged users can only use recursive bind mounts, so we will grab /sys/fs/cgroup from the host as well

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Sep 22, 2023
commit cf36470 changed the way
/sys/fs/cgroup is mounted when there is not a netns and it now honors
the ro flag.  The mount was created using a bind mount that is a
problem when using a cgroup namespace, fix that by mounting a fresh
cgroup file system.

Closes: containers#20073

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants