Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroupns: private cgroupns on cgroupv1 breaks --systemd #17736

Merged

Conversation

giuseppe
Copy link
Member

On cgroup v1 we need to mount only the systemd named hierarchy as writeable, so we configure the OCI runtime to mount /sys/fs/cgroup as read-only and on top of that bind mount /sys/fs/cgroup/systemd.

But when we use a private cgroupns, we cannot do that since we don't know the final cgroup path.

Closes: #17727

Does this PR introduce a user-facing change?

systemd mode refuses to work on cgroup v1 with a private cgroup namespace

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2023
@rhatdan
Copy link
Member

rhatdan commented Mar 11, 2023

LGTM
@containers/podman-maintainers PTAL

@LewisGaul
Copy link

It seems unfortunate to just error here, I would've thought there'd be a reasonable way to get this working as expected.

I left a comment on the issue, I'd be happy to discuss further - perhaps there's slight gaps in my knowledge around the finer details of cgroups, but hopefully I can bring something to the table having spent a lot of time looking into cgroups and how they're set up in a variety of scenarios across different container managers.

@giuseppe
Copy link
Member Author

unfortunate, but there is not an easy way to fix it, we'd need changes both in Podman and in the OCI runtime.

Given that cgroupv1 is going to be deprecated at some point (already planning that for crun: containers/crun#1149) there is no much sense in trying to solve it IMO. The best fix is to move to cgroupv2

@LewisGaul
Copy link

RE: cgroups v1 support, as I understand it RHEL 8 goes EOL in 2029, and uses cgroups v1 by default? Our intention is to support cgroups v1 probably for as long as it's reasonable for uses to have a host that's configured with v1 (e.g. RHEL 8 as long as it's supported).

@giuseppe
Copy link
Member Author

Sure it is supported until RHEL EOL but that is about security fixes. The current one is more a new feature that affects two components and to correctly implement it, also changes to the OCI runtime specs.

With cgroup v2 you also get proper cgroup support for rootless, which is not possible for cgroup v1, and containers are not really contained since you cannot apply any limit to them. Especially for rootless, there is no point for using cgroup v1 unless you are stuck with an old system

@giuseppe giuseppe force-pushed the no-private-cgroupns-systemd branch from dd02698 to 8726988 Compare March 11, 2023 19:18
@giuseppe
Copy link
Member Author

I made a change to the patch.

Now if you specify a /sys/fs/cgroup/systemd mount, then Podman won't override it and let you use it.

If you handle the cgroup by yourself, then you are able to use systemd with a new cgroup namespace:

$ systemd-run --scope --user bash
$ cat /proc/self/cgroup 
12:devices:/system.slice/sshd.service
11:pids:/user.slice/user-1000.slice/session-3.scope
10:perf_event:/
9:memory:/user.slice/user-1000.slice/session-3.scope
8:blkio:/system.slice/sshd.service
7:freezer:/
6:hugetlb:/
5:net_cls,net_prio:/
4:rdma:/
3:cpu,cpuacct:/
2:cpuset:/
1:name=systemd:/user.slice/user-1000.slice/[email protected]/run-rbbdf8246e7e34b8d894d94f8d4dab057.scope
0::/
$ mkdir /sys/fs/cgroup/systemd/user.slice/user-1000.slice/[email protected]/run-rbbdf8246e7e34b8d894d94f8d4dab057.scope/container
$ podman run --rm -ti -v /sys/fs/cgroup/systemd/user.slice/user-1000.slice/[email protected]/run-rbbdf8246e7e34b8d894d94f8d4dab057.scope/container:/sys/fs/cgroup/systemd --cgroupns private --systemd=always image-with-systemd /usr/sbin/init
systemd 251.13-5.fc37 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT d
efault-hierarchy=unified)                                                                                                 
Detected virtualization podman.                                                                                                                                                                                                                                                                                                
Detected architecture x86-64.
                                                                                                                                                                              
Welcome to Fedora Linux 37 (Container Image)!               
                                                                  
Not running with unified cgroup hierarchy, disabling cgroup BPF features.
Queued start job for default target graphical.target.
[  OK  ] Created slice system-getty.slice - Slice /system/getty.
[  OK  ] Created slice system-modprobe.slice - Slice /system/modprobe.
[  OK  ] Created slice user.slice - User and Session Slice.
[  OK  ] Started systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch.
[  OK  ] Started systemd-ask-password-wall.path - Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target local-fs.target - Local File Systems.
[  OK  ] Reached target network-online.target - Network is Online.
[  OK  ] Reached target paths.target - Path Units.
[  OK  ] Reached target remote-fs.target - Remote File Systems.
[  OK  ] Reached target slices.target - Slice Units.
[  OK  ] Reached target swap.target - Swaps.                                                            
[  OK  ] Listening on systemd-initctl.socket - initctl Compatibility Named Pipe.
[  OK  ] Listening on systemd-journald-dev-log.socket - Journal Socket (/dev/log).                                                                           
[  OK  ] Listening on systemd-journald.socket - Journal Socket.                                                                                                               
[  OK  ] Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
[  OK  ] Listening on systemd-userdbd.socket - User Database Manager Socket.                                                                                                                                                                                                                                                   
         Starting ldconfig.service - Rebuild Dynamic Linker Cache...
systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
(This warning is only shown for the first unit using IP firewalling.)
         Starting systemd-journald.service - Journal Service...
         Starting systemd-network-generator.service - Generate network units from Kernel command line...
         Starting systemd-sysusers.service - Create System Users...
[  OK  ] Finished systemd-network-generator.service - Generate network units from Kernel command line.
[  OK  ] Reached target network-pre.target - Preparation for Network.
[  OK  ] Started systemd-journald.service - Journal Service.
         Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
[  OK  ] Finished systemd-sysusers.service - Create System Users.                                                                                                                                                                                                                                                             
[  OK  ] Finished systemd-journal-flush.service - Flush Journal to Persistent Storage.
         Starting systemd-tmpfiles-setup.service - Create Volatile Files and Directories...                                                                                                                                                                                                                                    
[  OK  ] Finished ldconfig.service - Rebuild Dynamic Linker Cache.
[  OK  ] Finished systemd-tmpfiles-setup.service - Create Volatile Files and Directories.
         Starting systemd-journal-catalog-update.service - Rebuild Journal Catalog...
         Starting systemd-resolved.service - Network Name Resolution...
         Starting systemd-update-utmp.service - Record System Boot/Shutdown in UTMP...
         Starting systemd-userdbd.service - User Database Manager...
[  OK  ] Finished systemd-update-utmp.service - Record System Boot/Shutdown in UTMP.
[  OK  ] Started systemd-userdbd.service - User Database Manager.
[  OK  ] Started systemd-resolved.service - Network Name Resolution.
[  OK  ] Reached target nss-lookup.target - Host and Network Name Lookups.
[  OK  ] Finished systemd-journal-catalog-update.service - Rebuild Journal Catalog.
         Starting systemd-update-done.service - Update is Completed...                                     
[  OK  ] Finished systemd-update-done.service - Update is Completed.                                
[  OK  ] Reached target sysinit.target - System Initialization.
[  OK  ] Started dnf-makecache.timer - dnf makecache --timer.     
[  OK  ] Started systemd-tmpfiles-clean.timer - Daily Cleanup of Temporary Directories.
[  OK  ] Reached target timers.target - Timer Units.           
[  OK  ] Listening on dbus.socket - D-Bus System Message Bus Socket.
[  OK  ] Reached target sockets.target - Socket Units.
[  OK  ] Reached target basic.target - Basic System.                            
         Starting systemd-logind.service - User Login Management...               
         Starting systemd-user-sessions.service - Permit User Sessions...
         Starting dbus-broker.service - D-Bus System Message Bus...                     
[  OK  ] Finished systemd-user-sessions.service - Permit User Sessions.     
[  OK  ] Started console-getty.service - Console Getty.             
[  OK  ] Reached target getty.target - Login Prompts.                                                                  
[  OK  ] Started dbus-broker.service - D-Bus System Message Bus.     
[  OK  ] Started systemd-logind.service - User Login Management.
[  OK  ] Reached target multi-user.target - Multi-User System.                                          
[  OK  ] Reached target graphical.target - Graphical Interface.    
         Starting systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP...            
[  OK  ] Finished systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP.
                                                            
Fedora Linux 37 (Container Image)                                                       
Kernel 4.18.0-425.13.1.el8_7.x86_64 on an x86_64 (console)       
                                                                                      
9ebdff257048 login:                                                                        

@giuseppe giuseppe force-pushed the no-private-cgroupns-systemd branch from 8726988 to c47b780 Compare March 11, 2023 19:44
@TomSweeneyRedHat
Copy link
Member

LGTM

@test "podman --systemd fails on cgroup v1 with a private cgroupns" {
skip_if_cgroupsv2

run_podman 126 run --systemd=always --cgroupns=private $IMAGE true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add assert "$output" =~ "blah blah blah blah". 126 can happen for many reasons, it is important to verify that it's happening here for the expected reason.

@umohnani8
Copy link
Member

LGTM

@mheon
Copy link
Member

mheon commented Mar 13, 2023 via email

@giuseppe giuseppe force-pushed the no-private-cgroupns-systemd branch from c47b780 to 97f6c5b Compare March 14, 2023 10:39
the error is already clear.

Signed-off-by: Giuseppe Scrivano <[email protected]>
On cgroup v1 we need to mount only the systemd named hierarchy as
writeable, so we configure the OCI runtime to mount /sys/fs/cgroup as
read-only and on top of that bind mount /sys/fs/cgroup/systemd.

But when we use a private cgroupns, we cannot do that since we don't
know the final cgroup path.

Also, do not override the mount if there is already one for
/sys/fs/cgroup/systemd.

Closes: containers#17727

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the no-private-cgroupns-systemd branch from 97f6c5b to 2d1f4a8 Compare March 14, 2023 11:35
@edsantiago
Copy link
Member

/lgtm
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 14, 2023
@giuseppe
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2023
@openshift-merge-robot openshift-merge-robot merged commit 08cd180 into containers:main Mar 14, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Systemd in container hits critical errors when using private cgroup namespace on CentOS 8
8 participants