Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of systemd delegate cgroup when possible. #18211

Open
shoenig opened this issue Aug 15, 2023 · 2 comments
Open

Make use of systemd delegate cgroup when possible. #18211

shoenig opened this issue Aug 15, 2023 · 2 comments
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues theme/client type/enhancement

Comments

@shoenig
Copy link
Member

shoenig commented Aug 15, 2023

Per https://systemd.io/CGROUP_DELEGATION/ (at the bottom)

🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e cgroups you haven’t set Delegate= in. Specifically: 🔥 don’t create your own cgroups below the root cgroup 🔥.

Currently Nomad does exactly this - it creates the nomad.slice cgroup under the root cgroup regardless if systemd is in use or not. We should modify our linux packaging to set the Delegate in the systemd unit file so that we are in line with the expected usage of systemd.

However we'll need to continue supporting the mode of operation we have today - not all Linux operating systems use systemd (and thus have no delegate mechanism), and not all users use our Linux packaging. We'll also want to update our production documentation to make recommendations for such users.

@shoenig shoenig added type/enhancement theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Aug 15, 2023
@shoenig shoenig changed the title Make use of systemd delegeate cgroup when possible. Make use of systemd delegate cgroup when possible. Aug 15, 2023
@shoenig shoenig added the theme/cgroups cgroups issues label Aug 15, 2023
@tgross
Copy link
Member

tgross commented Aug 7, 2024

Another challenge with setting delegation is that we have a nomad.slice and slices aren't supported for delegation; we'd need to create a scope below nomad.slice and delegate that.

@tgross tgross added the hcc/jira label Aug 8, 2024
@tgross
Copy link
Member

tgross commented Aug 8, 2024

This was the bit from that document:

Let’s stress one thing: delegation is available on scope and service units only. It’s expressly not available on slice units. Why? Because slice units are our inner nodes of the cgroup trees and we freely attach services and scopes to them. If we’d allow delegation on slice units then this would mean that both systemd and your own manager would create/delete cgroups below the slice unit and that conflicts with the single-writer rule.

In an experiment I'm hacking on, turns out this doesn't really matter because we're not creating a "slice unit", we're just creating our own cgroup directory that happens to be called "slice". If we did want to create a slice unit in the package (which we will if we want to allow less-privileged Nomad agents), we'd instead want to have 2 Delegate= fields pointing to the shared.slice and reserved.slice not-really-slices below that. I've smoked-tested this so far, but it needs further investigation.

tgross added a commit that referenced this issue Aug 14, 2024
Nomad clients manage a cpuset cgroup for each task to reserve or share CPU
cores. But Docker owns its own cgroups, and attempting to set a parent cgroup
that Nomad manages runs into conflicts with how runc manages cgroups via
systemd. Therefore Nomad must run as root in order for cpuset management to ever
be compatible with Docker.

However, some users running in unsupported configurations felt that the changes
we made in Nomad 1.7.0 to ensure Nomad was running correctly represented a
regression. This changeset disables cpuset management for non-root Nomad
clients. When running Nomad as non-root, the driver will not longer reconcile
cpusets with Nomad and `resources.cores` will behave incorrectly (but the driver
will still run).

Although this is one small step along the way to supporting a rootless Nomad
client, running Nomad as non-root is still unsupported. This PR is insufficient
by itself to have a secure and properly-working rootless Nomad client.

Ref: #18211
Ref: #13669
Ref: https://hashicorp.atlassian.net/browse/NET-10652
Ref: https://github.com/opencontainers/runc/blob/main/docs/systemd.md
tgross added a commit that referenced this issue Aug 14, 2024
During Nomad client initialization with cgroups v2, we assert that the required
cgroup controllers are available in the root `cgroup.subtree_control` file by
idempotently writing to the file. But if Nomad is running with delegated
cgroups, this will fail file permissions checks even if the subtree control file
already has the controllers we need.

Update the initialization to first check if the controllers are missing before
attempting to write to them. This allows cgroup delegation so long as the
cluster administrator has pre-created a Nomad owned cgroups tree and set the
`Delegate` option in a systemd override. If not, initialization fails in the
existing way.

Although this is one small step along the way to supporting a rootless Nomad
client, running Nomad as non-root is still unsupported. I've intentionally not
documented setting up cgroup delegation in this PR, as this PR is insufficient
by itself to have a secure and properly-working rootless Nomad client.

Ref: #18211
Ref: #13669
tgross added a commit that referenced this issue Aug 14, 2024
Nomad clients manage a cpuset cgroup for each task to reserve or share CPU
cores. But Docker owns its own cgroups, and attempting to set a parent cgroup
that Nomad manages runs into conflicts with how runc manages cgroups via
systemd. Therefore Nomad must run as root in order for cpuset management to ever
be compatible with Docker.

However, some users running in unsupported configurations felt that the changes
we made in Nomad 1.7.0 to ensure Nomad was running correctly represented a
regression. This changeset disables cpuset management for non-root Nomad
clients. When running Nomad as non-root, the driver will not longer reconcile
cpusets with Nomad and `resources.cores` will behave incorrectly (but the driver
will still run).

Although this is one small step along the way to supporting a rootless Nomad
client, running Nomad as non-root is still unsupported. This PR is insufficient
by itself to have a secure and properly-working rootless Nomad client.

Ref: #18211
Ref: #13669
Ref: https://hashicorp.atlassian.net/browse/NET-10652
Ref: https://github.com/opencontainers/runc/blob/main/docs/systemd.md
tgross added a commit that referenced this issue Aug 14, 2024
…23803)

During Nomad client initialization with cgroups v2, we assert that the required
cgroup controllers are available in the root `cgroup.subtree_control` file by
idempotently writing to the file. But if Nomad is running with delegated
cgroups, this will fail file permissions checks even if the subtree control file
already has the controllers we need.

Update the initialization to first check if the controllers are missing before
attempting to write to them. This allows cgroup delegation so long as the
cluster administrator has pre-created a Nomad owned cgroups tree and set the
`Delegate` option in a systemd override. If not, initialization fails in the
existing way.

Although this is one small step along the way to supporting a rootless Nomad
client, running Nomad as non-root is still unsupported. I've intentionally not
documented setting up cgroup delegation in this PR, as this PR is insufficient
by itself to have a secure and properly-working rootless Nomad client.

Ref: #18211
Ref: #13669
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues theme/client type/enhancement
Projects
None yet
Development

No branches or pull requests

2 participants