-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running privileged systemd container in namespaced OpenShift pod fails #21008
Comments
I get the same result even if I amend the |
you've processes running in the cgroup root in the container, that prevents (on cgroupv2) to create sub-cgroups and move processes there. One thing you could try in the container is:
and make sure no other processes are running in the root cgroup (i.e. |
Thanks @giuseppe for those hints. When I added
to the shell command, I got
So the |
thanks, the 0s means that there are already processes in the cgroup but they are not part of the current PID namespace, so you cannot access/reference them but they are still there :/ That looks like a Kubernetes/CRI-O error. There should not be any process in your cgroup except the ones from your container. Not sure how the What is the output of
|
Is that expected? Shouldn't
I've used
and got
|
weird that a privileged container has access to the host cgroup but in read-only mode :/ Then you may need to setup a volume from the host
|
That sadly yields
and there is nothing cgroup-related in dmesg -- it ends with
|
I realized I might have misunderstood where you wanted me to do these changes. I did them in the shell script in the "outer" container, created by OpenShift / CRI-O, before running that When I put them to that
So in the container created by |
I'm probably also looking for some guidance if the |
I've not a working example as I've never tried this combination yet. I was not aware of the |
The problem is, I don't see any changes to behaviour when I use that (so perhaps it is the default already?). The @giuseppe If I got an access to a OpenShift cluster set up for you, would you be willing to investigate the behaviour directly? |
yes that will probably help so I can investigate what is going on. Does tomorrow morning (Europe time) work well with you? |
Yes. I'll ping you. Thank you! |
after the investigation, it turned out the root reason for these failures is that CRI-O doesn't delegate the cgroup to any user in the user namespace. So even if the cgroup file system is mounted as writeable, no user in the user namespace can write to it. I've opened an issue with CRI-O to request for this feature: cri-o/cri-o#7623 |
With @rata 's help in cri-o/cri-o#7623, I was able to make a small progress with the investigation on OpenShift. It turns out that the To use that
and the The
then needs to be replaced with
Keeping that To help debugging, I added With these changes, I'm able to run the podman in OpenShift user namespaced, and the However, since we now run the Pod's container not privileged, running
|
@adelton in the not privileged case, does it work if you mount an emptyDir (tmpfs) on the pod in /var/lib/containers/storage? Overlayfs inside overlayfs is problematic, that way it should avoid it. But not sure if that would be enough. |
Hi, yes, we experienced similar in our environment (not using OpenShift but using cri-o as the runtime). To confirm the overlayfs issue you can check dmesg as the kernel will log something like:
There's two options to fix, as @rata says you can use an emptyDir (or other volume) mounted on the container storage location, or you can make fuse-overlayfs work (add |
Adding
does not seem to change the outcome -- I still get the Adding
and
|
Defining a PVC
changes the
(ext4 instead of xfs), the
(the permissions are now 0755 instead of 0777 in case of the
So the question is, with a volume, what is the recommended way to get CRI-O (?) to chown the volume to match the user-namespaced uid 0 in the Pod's container, in my case
? |
(Aside to the main issue: we do successfully use overlayfs backed with xfs, per the Docker docs you need to format with Setting (This is because CRI-O's annotation based |
@rata No, Adding for example
changes the ownership on the PVC-backed volume to
and that in turn brings the behaviour en par with the
error. |
With that
and adding
to the command list before running that
So I guess the question now is -- if I seem to have a root in a user namespace, with (For the record, adding |
Removing
which gets easily fixed by adding The next hurdle is then
|
This is likely due to having a masked /proc and /sys. ProcMountType Unmasked (set on the pod and the alpha feature gate) will allow the mount. Although as you're CAP_SYS_ADMIN you can also just umount the masking paths. |
Thanks for the hints. My understanding is alpha features are not available on OpenShift. When I added
to the list of commands, I got
Is that |
@adelton are you using crun or runc? I think the sysfs error will go away with crun, as it fallback to do bind-mount of /sys when it can't mount a fresh sysfs. |
The message says
so I would assume
|
For the record, when I add
|
Breakthrough: with
and adding I need to go back and re-reproduce to make sure that the setup is really as confined as I wated it to be. |
So on a fresh OpenShift cluster
I've verified that the following steps work:
Run
shows that we run in a user namespace and
shows the systemd in podman in the container in the Pod works. To sum up, the changes needed in the Pod were
Based on recommendation in cri-o/cri-o#7623 I tried to avoid using On the other hand, the |
Issue Description
I try to get Kind (with podman) run in OpenShift rootless pods: https://github.com/adelton/kind-in-pod
I have minimized the problem to running a privileged podman container with
--cgroupns=private
, run in a privileged OpenShift Pod. It passes when run as uid 0 but fails when run user namespaced.Steps to reproduce the issue
Steps to reproduce the issue
user
) and an admin account (admin
) with the cluster-admin role.oc new-project test-1
test-1
as privileged:oc adm policy add-scc-to-user privileged -z default -n test-1
test-podman.yaml
oc apply -f test-podman.yaml
oc logs -f pod/test-podman
shows systemd is running in that Pod started by the podman in the OpenShift Pod:oc delete -f test-podman.yaml
annotations
andsecurityContext
:oc apply -f test-podman.yaml
oc logs -f pod/test-podman
reports about the Pod.Describe the results you received
The output shows that the OpenShift Pod now runs user namespaced (uid 0 in the Pod is uid 300000 on the worker host) ... but then systemd fails even if
/sys/fs/cgroup
is shown mounted read-write both in the OpenShift Pod and in the podman Pod:Describe the results you expected
No error, systemd running in that user namespace (created by OpenShift's CRI-O).
podman info output
Podman in a container
Yes
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
Note that the "Privileged Or Rootless" selection that the Bug report form makes me make does not really make sense -- I try to run it privileged but rootless at the same time, just the user namespacing is not done by podman but by CRI-O.
Additional information
Deterministic.
The text was updated successfully, but these errors were encountered: