Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuset support with cgroupsv2 #11289

Closed
notnoop opened this issue Oct 8, 2021 · 3 comments · Fixed by #12274
Closed

cpuset support with cgroupsv2 #11289

notnoop opened this issue Oct 8, 2021 · 3 comments · Fixed by #12274
Assignees
Labels
Milestone

Comments

@notnoop
Copy link
Contributor

notnoop commented Oct 8, 2021

Nomad 1.1 Reserved CPUs feature relies on cpuset cgroup capabilities, and uses a technique no longer supported in cgroup-v2.

The reserved CPUs are dedicated to their tasks - other tasks may not access them. Nomad achieve this by using the cpuset cgroups controller. Nomad creates a /nomad/shared cgroup for all non-reserving tasks, which starts unconstrained. If a reserved CPU task is placed on the host, Nomad removes the reserved core from this shared cgroup - constraining already running tasks from running on the core.

Thus, non-reserving tasks belong to at least two cgroups: the nomad/shared cpuset cgroup as well as cgroup that's created by the task driver. For example, a docker task process has the following cgroup assignment (note the cpuset value):

cat /proc/11050/cgroup
12:devices:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
11:pids:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
10:net_cls,net_prio:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
9:freezer:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
8:rdma:/
7:perf_event:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
6:cpuset:/nomad/shared
5:cpu,cpuacct:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
4:memory:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
3:blkio:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
2:hugetlb:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
1:name=systemd:/docker/10234c078fce6da0cb3e58c9da71a01852f0241bd5ce6b0e9fd19ec296e1aea0
0::/system.slice/containerd.service

This technique relies on cgroup-v1 hierarchy flexibility. cgroup-v2 no longer permits a process to belong to multiple cgroups! We will need to rethink this mechanism.

Cgroups v2 unified hierarchy
In cgroups v1, the ability to mount different controllers against
different hierarchies was intended to allow great flexibility for
application design. In practice, though, the flexibility turned
out to be less useful than expected, and in many cases added
complexity. Therefore, in cgroups v2, all available controllers
are mounted against a single hierarchy.

From https://man7.org/linux/man-pages/man7/cgroups.7.html

Oddly enough, Nomad uses the same pattern and creates a cpuset cgroup for reserving tasks, instead of setting cpuset cgroup in the driver-managed cgroup. I'm unsure of what the motivation; if deemed unnecessary, we can have the drivers set the cpuset settings appropriately.

Resources

@notnoop notnoop added type/bug theme/cgroups cgroups issues labels Oct 8, 2021
@notnoop notnoop self-assigned this Oct 8, 2021
@tgross tgross assigned tgross and unassigned notnoop Nov 2, 2021
@tgross
Copy link
Member

tgross commented Nov 3, 2021

I think this can be solved in a way that still works with Docker by leaning on the --cgroup-parent flag that we can set in the HostConfig. This field is available in the Docker API since API v1.18 (at least... that's as far back as their docs go), and that's been available in Docker since Docker 1.10 shipped in early 2016.

This way we don't need to move the process into the new cgroup and can instead create it under the correct hierarchy and only change the cgroup controllers.

  • We set up the /nomad cgroup (rooted under /sys/fs/cgroup/) with no processes in it (this might be named differently if the cgroup_parent name is set on the client).
  • We set up the /nomad/shared cgroup with cpuset set to all CPUs.
  • We set up the /nomad/reserved cgroup with no cpuset set.
  • When a task without cores is created it's created in the /nomad/shared/<alloc ID>/<task name> cgroup without any cpuset (cgroup_parent = /nomad/shared).
  • When a task with cores is created, it's created in the /nomad/reserved/<alloc ID>/<task name> cgroup with the appropriate cpuset (cgroup_parent = /nomad/reserved). And we update the /nomad/shared cgroup to remove that cpu.

So for example, assume we have 2 tasks with cpu = 1024 and memory = 1024 on an 8-core machine. The cgroup hierarchy looks like this:

/nomad
└── /nomad/shared: cpuset:0-7
    ├──  /nomad/shared/alloc_id_1/taskname: cpu:1024,mem:1024 (inherits cpuset:0-7)
    └──  /nomad/shared/alloc_id_2/taskname: cpu:1024,mem:1024 (inherits cpuset:0-7)
└── /nomad/reserved:
    (no children)

Then we add a new task with cores = 1 and memory = 1024. The new cgroup hierarchy looks like this:

/nomad
└── /nomad/shared: cpuset:1-7
    ├──  /nomad/shared/alloc_id_1/taskname: cpu:1024,mem:1024 (inherits cpuset:1-7)
    └──  /nomad/shared/alloc_id_2/taskname: cpu:1024,mem:1024 (inherits cpuset:1-7)
└── /nomad/reserved:
    └──  /nomad/reserved/alloc_id_3: cpuset:0,mem:1024

As far as I can tell this will work just fine for both cgroups v1 and v2, but I do want to do some hands-on testing to be sure. It looks like this approach was mentioned as a comment to the original design doc NMD-098 (internal doc) but didn't get followed-up on. 🤷

There's some backwards compatibility concerns here for upgrading existing clients that have running cgroup v1 containers. In the design doc we said:

During inplace client upgrades to 1.1 tasks will be running outside the desired cgroup. As part of the upgrade path, the Nomad client will create the necessary cpuset cgroups and move the task pids into them so as to not disrupt any reserved core workloads that are placed on the client in the future.

It looks like this works by unconditionally writing the PID to the new cgroup in setCPUSetCgroup. For this new hierarchy we'd need to:

  • write a new cgroup at the expected path if it doesn't exist
  • set the values to match all the current cgroups
  • move the PID into that cgroup

@tgross
Copy link
Member

tgross commented Nov 3, 2021

Well it turns out we can't have an intermediate hierarchy on the cgroup like /nomad/shared/alloc_id_1/taskname if we want to pick up the cpuset from /nomad/shared. ref cgroup v2 docs:

Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled.

So at the very least we'll end up with a flat per-task cgroup hierarchy, which is what we have already for everything other than the cpuset controller. Not a big deal but we'll have to adjust the logic in client/lib/cgutil/cpuset_manager_linux.go to use a flat namespace. (It's not clear to me yet if this applies to v1, but in any case we'd want to use the most-restricted option.)

@tgross tgross removed their assignment Jan 26, 2022
@mikenomitch mikenomitch added this to the 1.3.0 milestone Mar 1, 2022
shoenig added a commit that referenced this issue Mar 21, 2022
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to "v1" or "v2" to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
shoenig added a commit that referenced this issue Mar 21, 2022
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to "v1" or "v2" to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
shoenig added a commit that referenced this issue Mar 23, 2022
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants