Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the node image respect rootless detection from the docker/podman info #2492

Closed
wants to merge 11 commits into from
41 changes: 17 additions & 24 deletions images/base/files/usr/local/bin/entrypoint
Original file line number Diff line number Diff line change
Expand Up @@ -24,41 +24,34 @@ set -o pipefail
userns=""
if grep -Eqv "0[[:space:]]+0[[:space:]]+4294967295" /proc/self/uid_map; then
userns="1"
echo 'INFO: running in a user namespace (experimental)'
fi

rootless=""
if [[ -n "${KIND_ROOTLESS-}" ]]; then
rootless=1
Comment on lines +29 to +31
Copy link
Contributor Author

@felipecrs felipecrs Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make the newer images to keep working with kind 0.11.0 and 0.11.1, I can:

  1. Always pass KIND_ROOTLESS, even when it's false.
  2. If KIND_ROOTLESS is present, respect it (be it false or true).
  3. If KIND_ROOTLESS is not present, fallback to the old detection method.

Please let me know if you would like me to implement this safeguard.

fi

validate_userns() {
if [[ -z "${userns}" ]]; then
return
fi
echo 'INFO: running in a user namespace (experimental)' >&2

local nofile_hard
nofile_hard="$(ulimit -Hn)"
local nofile_hard_expected="64000"
if [[ "${nofile_hard}" -lt "${nofile_hard_expected}" ]]; then
echo "WARN: UserNS: expected RLIMIT_NOFILE to be at least ${nofile_hard_expected}, got ${nofile_hard}" >&2
fi

if [[ ! -f "/sys/fs/cgroup/cgroup.controllers" ]]; then
echo "ERROR: UserNS: cgroup v2 needs to be enabled" >&2
exit 1
fi
for f in cpu memory pids; do
if ! grep -qw $f /sys/fs/cgroup/cgroup.controllers; then
echo "ERROR: UserNS: $f controller needs to be delegated" >&2
exit 1
fi
done
}

configure_containerd() {
local snapshotter=${KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER:-}
if [[ -n "$userns" ]]; then
# userns (rootless) configs

# Adjust oomScoreAdj
sed -i 's/restrict_oom_score_adj = false/restrict_oom_score_adj = true/' /etc/containerd/config.toml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still about userns, not really about rootless.

When you are inside a userns you have a boundary on oom_score_adj.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can't use real overlayfs snapshotter in userns, so fuse-overlayfs should be still chosen by default
(Unless kernel >= 5.12)

Copy link
Contributor Author

@felipecrs felipecrs Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can't use real overlayfs snapshotter in userns, so fuse-overlayfs should be still chosen by default
(Unless kernel >= 5.12)

That's not true either. With sysbox, we can. This is not coupled with being in userns, this is coupled with being in docker daemon rootless mode.

In fact, if fuse-overlayfs is chosen, it does not work: #2492 (comment)

Copy link
Contributor Author

@felipecrs felipecrs Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are inside a userns you have a boundary on oom_score_adj.

About this I can't really tell if it's specific to user namespace or being in rootless mode. @rodnymolina any advice?

In either case, it does not trigger any issue that I have noticed. That's why I left it being ran even if not in rootless mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true either. With sysbox, we can. This is not coupled with being in userns, this is coupled with being in docker daemon rootless mode.

Ubuntu kernel has been patched to allow mounting overlayfs in userns for a long time.

If you aren't using Ubuntu kernel, that seems rather coupled with sysbox.

I'd rather add KIND_SYSBOX env var for sysbox-specific logic.
#2492 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can't use real overlayfs snapshotter in userns

Not necessarily; Sysbox creates containers with user-ns, but allows you to mount overlayfs in them (even if kernel < 5.12). How does it do it? It intercepts the mount syscall via seccomp-notify, vets the syscall, and performs the mount on behalf of the container.

And this is the crux of this patch: running in a user-ns does not necessarily mean you are restricted in the ways you may think. If you are running in a user-ns created simply with unshare(), those restrictions do apply. But if you are running in a user-ns inside a container, those restrictions may not apply as the underlying runtime may have setup the container in such a way that the restrictions are lifted (as Sysbox does).

This is why @felipecrs wants to tie the entrypoint checks to the name "rootless" rather than "userns". In this case "rootless" means running inside a userns but not inside a container (e.g., unshare()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but !rootless doesn't mean mount is intercepted to allow mounting overlay inside userns.

So the right thing would be to check whether the runtime is sysbox, not about checking whether the daemon is running in rootless mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Alternatively we can just test whether overlay is mountable in the entrypoint script, but I wouldn't recommend that, as Debian (not Ubuntu) implementation of permit_mounts_in_userns is known to be broken 😥 moby/moby#42302 )

Copy link

@ctalledo ctalledo Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AkihiroSuda.

So the right thing would be to check whether the runtime is sysbox,

I suggest we avoid that, because there may be other runtimes that enable mounting of overlayfs inside a container that uses the userns (not sure if LXD allows this for example). I think the change should be runtime agnostic.

Alternatively we can just test whether overlay is mountable in the entrypoint script,

I like this idea better because that way the change is more generic: it enables the entrypoint to setup things correctly according to the limitations of the environment in which it runs.

For example, if it runs in a userns created via unshare() on kernels < 5.12, it will use fuse-overlayfs. But on kernels >= 5.12, it will use overlayfs directly. Similarly, if it runs in a container setup by a runtime that has enables overlayfs mounting inside the user-ns (e.g., Sysbox), things will also work fine.

Debian (not Ubuntu) implementation of permit_mounts_in_userns is known to be broken

Is there a way to account for that in the entrypoint checking logic?


fi
if [[ -n "$rootless" ]]; then
# Use fuse-overlayfs by default: https://github.com/kubernetes-sigs/kind/issues/2275
snapshotter="fuse-overlayfs"
else
Expand Down Expand Up @@ -102,15 +95,15 @@ fix_mount() {
sync
fi

if [[ -z "${userns}" ]]; then
echo 'INFO: remounting /sys read-only'
# systemd-in-a-container should have read only /sys
# https://systemd.io/CONTAINER_INTERFACE/
# however, we need other things from `docker run --privileged` ...
# and this flag also happens to make /sys rw, amongst other things
#
# This step is skipped when running inside UserNS, because it fails with EACCES.
mount -o remount,ro /sys
echo 'INFO: remounting /sys read-only'
# systemd-in-a-container should have read only /sys
# https://systemd.io/CONTAINER_INTERFACE/
# however, we need other things from `docker run --privileged` ...
# and this flag also happens to make /sys rw, amongst other things
#
# This step is ignored when running inside UserNS, because it may fail with EACCES.
if ! mount -o remount,ro /sys && [[ -n "$userns" ]]; then
echo 'INFO: UserNS: ignoring mount fail' >&2
fi

echo 'INFO: making mounts shared' >&2
Expand Down
9 changes: 9 additions & 0 deletions pkg/cluster/internal/providers/docker/provision.go
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,15 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args...,
)

// let the container know that it's running in rootless mode
info, err := info()
if err != nil {
return nil, err
}
if info.Rootless {
args = append(args, "-e", "KIND_ROOTLESS=1")
}

// convert mounts and port mappings to container run args
args = append(args, generateMountBindings(node.ExtraMounts...)...)
mappingArgs, err := generatePortMappings(clusterIPFamily, node.ExtraPortMappings...)
Expand Down
2 changes: 1 addition & 1 deletion pkg/cluster/internal/providers/podman/provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -407,7 +407,7 @@ func info(logger log.Logger) (*providers.ProviderInfo, error) {
SupportsPidsLimit: true, // not guaranteed to be correct
SupportsCPUShares: true, // not guaranteed to be correct
}
if info.Rootless {
if logger != nil && info.Rootless {
logger.Warn("Cgroup controller detection is not implemented for Podman. " +
"If you see cgroup-related errors, you might need to set systemd property \"Delegate=yes\", see https://kind.sigs.k8s.io/docs/user/rootless/")
}
Expand Down
9 changes: 9 additions & 0 deletions pkg/cluster/internal/providers/podman/provision.go
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,15 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args...,
)

// let the container know that it's running in rootless mode
info, err := info(nil)
if err != nil {
return nil, err
}
if info.Rootless {
args = append(args, "-e", "KIND_ROOTLESS=1")
}

// convert mounts and port mappings to container run args
args = append(args, generateMountBindings(node.ExtraMounts...)...)
mappingArgs, err := generatePortMappings(clusterIPFamily, node.ExtraPortMappings...)
Expand Down