cpuset support with cgroupsv2 #11289

notnoop · 2021-10-08T15:34:12Z

Nomad 1.1 Reserved CPUs feature relies on cpuset cgroup capabilities, and uses a technique no longer supported in cgroup-v2.

The reserved CPUs are dedicated to their tasks - other tasks may not access them. Nomad achieve this by using the cpuset cgroups controller. Nomad creates a /nomad/shared cgroup for all non-reserving tasks, which starts unconstrained. If a reserved CPU task is placed on the host, Nomad removes the reserved core from this shared cgroup - constraining already running tasks from running on the core.

Thus, non-reserving tasks belong to at least two cgroups: the nomad/shared cpuset cgroup as well as cgroup that's created by the task driver. For example, a docker task process has the following cgroup assignment (note the cpuset value):

cat /proc/11050/cgroup
12:devices:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
11:pids:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
10:net_cls,net_prio:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
9:freezer:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
8:rdma:/
7:perf_event:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
6:cpuset:/nomad/shared
5:cpu,cpuacct:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
4:memory:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
3:blkio:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
2:hugetlb:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
1:name=systemd:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
0::/system.slice/containerd.service

This technique relies on cgroup-v1 hierarchy flexibility. cgroup-v2 no longer permits a process to belong to multiple cgroups! We will need to rethink this mechanism.

Cgroups v2 unified hierarchy
In cgroups v1, the ability to mount different controllers against
different hierarchies was intended to allow great flexibility for
application design. In practice, though, the flexibility turned
out to be less useful than expected, and in many cases added
complexity. Therefore, in cgroups v2, all available controllers
are mounted against a single hierarchy.

From https://man7.org/linux/man-pages/man7/cgroups.7.html

Oddly enough, Nomad uses the same pattern and creates a cpuset cgroup for reserving tasks, instead of setting cpuset cgroup in the driver-managed cgroup. I'm unsure of what the motivation; if deemed unnecessary, we can have the drivers set the cpuset settings appropriately.

Resources

The text was updated successfully, but these errors were encountered:

tgross · 2021-11-03T14:56:22Z

I think this can be solved in a way that still works with Docker by leaning on the --cgroup-parent flag that we can set in the HostConfig. This field is available in the Docker API since API v1.18 (at least... that's as far back as their docs go), and that's been available in Docker since Docker 1.10 shipped in early 2016.

This way we don't need to move the process into the new cgroup and can instead create it under the correct hierarchy and only change the cgroup controllers.

We set up the /nomad cgroup (rooted under /sys/fs/cgroup/) with no processes in it (this might be named differently if the cgroup_parent name is set on the client).
We set up the /nomad/shared cgroup with cpuset set to all CPUs.
We set up the /nomad/reserved cgroup with no cpuset set.
When a task without cores is created it's created in the /nomad/shared/<alloc ID>/<task name> cgroup without any cpuset (cgroup_parent = /nomad/shared).
When a task with cores is created, it's created in the /nomad/reserved/<alloc ID>/<task name> cgroup with the appropriate cpuset (cgroup_parent = /nomad/reserved). And we update the /nomad/shared cgroup to remove that cpu.

So for example, assume we have 2 tasks with cpu = 1024 and memory = 1024 on an 8-core machine. The cgroup hierarchy looks like this:

/nomad
└── /nomad/shared: cpuset:0-7
    ├──  /nomad/shared/alloc_id_1/taskname: cpu:1024,mem:1024 (inherits cpuset:0-7)
    └──  /nomad/shared/alloc_id_2/taskname: cpu:1024,mem:1024 (inherits cpuset:0-7)
└── /nomad/reserved:
    (no children)

Then we add a new task with cores = 1 and memory = 1024. The new cgroup hierarchy looks like this:

/nomad
└── /nomad/shared: cpuset:1-7
    ├──  /nomad/shared/alloc_id_1/taskname: cpu:1024,mem:1024 (inherits cpuset:1-7)
    └──  /nomad/shared/alloc_id_2/taskname: cpu:1024,mem:1024 (inherits cpuset:1-7)
└── /nomad/reserved:
    └──  /nomad/reserved/alloc_id_3: cpuset:0,mem:1024

As far as I can tell this will work just fine for both cgroups v1 and v2, but I do want to do some hands-on testing to be sure. It looks like this approach was mentioned as a comment to the original design doc NMD-098 (internal doc) but didn't get followed-up on. 🤷

There's some backwards compatibility concerns here for upgrading existing clients that have running cgroup v1 containers. In the design doc we said:

During inplace client upgrades to 1.1 tasks will be running outside the desired cgroup. As part of the upgrade path, the Nomad client will create the necessary cpuset cgroups and move the task pids into them so as to not disrupt any reserved core workloads that are placed on the client in the future.

It looks like this works by unconditionally writing the PID to the new cgroup in setCPUSetCgroup. For this new hierarchy we'd need to:

write a new cgroup at the expected path if it doesn't exist
set the values to match all the current cgroups
move the PID into that cgroup

tgross · 2021-11-03T15:26:14Z

Well it turns out we can't have an intermediate hierarchy on the cgroup like /nomad/shared/alloc_id_1/taskname if we want to pick up the cpuset from /nomad/shared. ref cgroup v2 docs:

Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled.

So at the very least we'll end up with a flat per-task cgroup hierarchy, which is what we have already for everything other than the cpuset controller. Not a big deal but we'll have to adjust the logic in client/lib/cgutil/cpuset_manager_linux.go to use a flat namespace. (It's not clear to me yet if this applies to v1, but in any case we'd want to use the most-restricted option.)

This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to "v1" or "v2" to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933

This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933

github-actions · 2022-10-10T02:44:09Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added type/bug theme/cgroups cgroups issues labels Oct 8, 2021

notnoop self-assigned this Oct 8, 2021

tgross assigned tgross and unassigned notnoop Nov 2, 2021

tgross mentioned this issue Jan 14, 2022

Nomad CPU pinning is moving the container after it's created. #11705

Closed

tgross removed their assignment Jan 26, 2022

shoenig mentioned this issue Jan 26, 2022

client: setting empty cpu cgroup is broken #11933

Closed

mikenomitch assigned shoenig Mar 1, 2022

mikenomitch added this to the 1.3.0 milestone Mar 1, 2022

tgross mentioned this issue Mar 7, 2022

cores = 2 in resources not working on VM on Ubuntu 21.10 #11773

Closed

shoenig mentioned this issue Mar 10, 2022

build: first pass at running CI on GitHub Actions #12255

Closed

shoenig mentioned this issue Mar 21, 2022

client: enable cpuset support for cgroups.v2 #12274

Merged

shoenig mentioned this issue Mar 22, 2022

Implement raw_exec cgroups v2 support #12348

Closed

shoenig closed this as completed in #12274 Mar 24, 2022

shoenig mentioned this issue Apr 5, 2022

Memory stats and freezer management with cgroupv2 #10251

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuset support with cgroupsv2 #11289

cpuset support with cgroupsv2 #11289

notnoop commented Oct 8, 2021

tgross commented Nov 3, 2021 •

edited

Loading

tgross commented Nov 3, 2021

github-actions bot commented Oct 10, 2022

cpuset support with cgroupsv2 #11289

cpuset support with cgroupsv2 #11289

Comments

notnoop commented Oct 8, 2021

Resources

tgross commented Nov 3, 2021 • edited Loading

tgross commented Nov 3, 2021

github-actions bot commented Oct 10, 2022

tgross commented Nov 3, 2021 •

edited

Loading