-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the node image respect rootless
detection from the docker
/podman info
#2492
Changes from all commits
53ea2ae
9a95af7
761109c
620b7d3
95f7540
382a570
e55e1dd
cd02096
8d100f6
54489fe
d0686ec
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,41 +24,34 @@ set -o pipefail | |
userns="" | ||
if grep -Eqv "0[[:space:]]+0[[:space:]]+4294967295" /proc/self/uid_map; then | ||
userns="1" | ||
echo 'INFO: running in a user namespace (experimental)' | ||
fi | ||
|
||
rootless="" | ||
if [[ -n "${KIND_ROOTLESS-}" ]]; then | ||
rootless=1 | ||
fi | ||
|
||
validate_userns() { | ||
if [[ -z "${userns}" ]]; then | ||
return | ||
fi | ||
echo 'INFO: running in a user namespace (experimental)' >&2 | ||
|
||
local nofile_hard | ||
nofile_hard="$(ulimit -Hn)" | ||
local nofile_hard_expected="64000" | ||
if [[ "${nofile_hard}" -lt "${nofile_hard_expected}" ]]; then | ||
echo "WARN: UserNS: expected RLIMIT_NOFILE to be at least ${nofile_hard_expected}, got ${nofile_hard}" >&2 | ||
fi | ||
|
||
if [[ ! -f "/sys/fs/cgroup/cgroup.controllers" ]]; then | ||
echo "ERROR: UserNS: cgroup v2 needs to be enabled" >&2 | ||
exit 1 | ||
fi | ||
for f in cpu memory pids; do | ||
if ! grep -qw $f /sys/fs/cgroup/cgroup.controllers; then | ||
echo "ERROR: UserNS: $f controller needs to be delegated" >&2 | ||
exit 1 | ||
fi | ||
done | ||
} | ||
|
||
configure_containerd() { | ||
local snapshotter=${KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER:-} | ||
if [[ -n "$userns" ]]; then | ||
# userns (rootless) configs | ||
|
||
# Adjust oomScoreAdj | ||
sed -i 's/restrict_oom_score_adj = false/restrict_oom_score_adj = true/' /etc/containerd/config.toml | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is still about userns, not really about rootless. When you are inside a userns you have a boundary on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And you can't use real overlayfs snapshotter in userns, so fuse-overlayfs should be still chosen by default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's not true either. With sysbox, we can. This is not coupled with being in userns, this is coupled with being in docker daemon rootless mode. In fact, if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
About this I can't really tell if it's specific to user namespace or being in rootless mode. @rodnymolina any advice? In either case, it does not trigger any issue that I have noticed. That's why I left it being ran even if not in rootless mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ubuntu kernel has been patched to allow mounting overlayfs in userns for a long time. If you aren't using Ubuntu kernel, that seems rather coupled with sysbox. I'd rather add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not necessarily; Sysbox creates containers with user-ns, but allows you to mount overlayfs in them (even if kernel < 5.12). How does it do it? It intercepts the mount syscall via seccomp-notify, vets the syscall, and performs the mount on behalf of the container. And this is the crux of this patch: running in a user-ns does not necessarily mean you are restricted in the ways you may think. If you are running in a user-ns created simply with unshare(), those restrictions do apply. But if you are running in a user-ns inside a container, those restrictions may not apply as the underlying runtime may have setup the container in such a way that the restrictions are lifted (as Sysbox does). This is why @felipecrs wants to tie the entrypoint checks to the name "rootless" rather than "userns". In this case "rootless" means running inside a userns but not inside a container (e.g., unshare()). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, but So the right thing would be to check whether the runtime is sysbox, not about checking whether the daemon is running in rootless mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Alternatively we can just test whether There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @AkihiroSuda.
I suggest we avoid that, because there may be other runtimes that enable mounting of overlayfs inside a container that uses the userns (not sure if LXD allows this for example). I think the change should be runtime agnostic.
I like this idea better because that way the change is more generic: it enables the entrypoint to setup things correctly according to the limitations of the environment in which it runs. For example, if it runs in a userns created via unshare() on kernels < 5.12, it will use fuse-overlayfs. But on kernels >= 5.12, it will use overlayfs directly. Similarly, if it runs in a container setup by a runtime that has enables overlayfs mounting inside the user-ns (e.g., Sysbox), things will also work fine.
Is there a way to account for that in the entrypoint checking logic? |
||
|
||
fi | ||
if [[ -n "$rootless" ]]; then | ||
# Use fuse-overlayfs by default: https://github.com/kubernetes-sigs/kind/issues/2275 | ||
snapshotter="fuse-overlayfs" | ||
else | ||
|
@@ -102,15 +95,15 @@ fix_mount() { | |
sync | ||
fi | ||
|
||
if [[ -z "${userns}" ]]; then | ||
echo 'INFO: remounting /sys read-only' | ||
# systemd-in-a-container should have read only /sys | ||
# https://systemd.io/CONTAINER_INTERFACE/ | ||
# however, we need other things from `docker run --privileged` ... | ||
# and this flag also happens to make /sys rw, amongst other things | ||
# | ||
# This step is skipped when running inside UserNS, because it fails with EACCES. | ||
mount -o remount,ro /sys | ||
echo 'INFO: remounting /sys read-only' | ||
# systemd-in-a-container should have read only /sys | ||
# https://systemd.io/CONTAINER_INTERFACE/ | ||
# however, we need other things from `docker run --privileged` ... | ||
# and this flag also happens to make /sys rw, amongst other things | ||
# | ||
# This step is ignored when running inside UserNS, because it may fail with EACCES. | ||
if ! mount -o remount,ro /sys && [[ -n "$userns" ]]; then | ||
echo 'INFO: UserNS: ignoring mount fail' >&2 | ||
fi | ||
|
||
echo 'INFO: making mounts shared' >&2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make the newer images to keep working with
kind
0.11.0 and 0.11.1, I can:KIND_ROOTLESS
, even when it'sfalse
.KIND_ROOTLESS
is present, respect it (be itfalse
ortrue
).KIND_ROOTLESS
is not present, fallback to the old detection method.Please let me know if you would like me to implement this safeguard.