-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raw_exec: make raw exec driver work with cgroups v2 #12419
Conversation
@@ -78,7 +78,8 @@ jobs: | |||
run: | | |||
make bootstrap | |||
make generate-all | |||
make test-nomad-module | |||
sudo sed -i 's!Defaults!#Defaults!g' /etc/sudoers | |||
sudo -E env "PATH=$PATH" make test-nomad-module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestFS_Log
checks the alloc becomes complete
, which it doesn't because it gets blocked by the missing cgroup for cpuset unless running as root.
This PR adds support for the raw_exec driver on systems with only cgroups v2. The raw exec driver is able to use cgroups to manage processes. This happens only on Linux, when exec_driver is enabled, and the no_cgroups option is not set. The driver uses the freezer controller to freeze processes of a task, issue a sigkill, then unfreeze. Previously the implementation assumed cgroups v1, and now it also supports cgroups v2. There is a bit of refactoring in this PR, but the fundamental design remains the same. Closes #12351 #12348
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @shoenig. I've left some comments but they're mostly more like questions. 😀
// CgroupParent for this node (linux only) | ||
CgroupParent string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this show up in the api.Node
struct as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! fixed
nomad/core_sched_test.go
Outdated
@@ -2244,7 +2244,7 @@ func TestCoreScheduler_CSIPluginGC(t *testing.T) { | |||
} | |||
|
|||
func TestCoreScheduler_CSIVolumeClaimGC(t *testing.T) { | |||
ci.Parallel(t) | |||
ci.SkipSlow(t, "flaky on GHA; #12358") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yup this is out of date, removed!
type containment struct { | ||
// non-linux executors currently do not create resources to be cleaned up | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a little trouble following what we're doing with the Containment
interface. We don't implement it for the non-Linux case (here), so isn't there only ever 1 implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the only implementation applies to Linux and the creation / usage is all gated in build-tagged files. The implementation in containment_default.go
isn't even complete much less used; let's just get rid of it.
// kill the processes in cgroup | ||
for _, pid := range pids { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not expect that all the PIDs are going to be in the same process group? That is, could we kill via -pid
here the same way we do for non-Linux use cases, or is this so we can avoid killing the executor even though we moved it into a different cgroup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is honestly just copying the original implementation, but I believe this defends against sub-sub-sub processes that get daemonized. So like
Executor -> A -> B -> C
If B dies C gets owed by PID 1, not A and is no longer part of the process group ... right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Children always get the parent's process group when forked, so the chain would be preserved even in that case (I don't think it gets changed after the fact on wait
). But this implementation works, so let's keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty in the weeds but I guess it does protect against callers of setsid
or setpgid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. 👍
// todo(shoenig) - Setting AllocID causes test to fail - with or without | ||
// cgroups, and with or without chroot. It has to do with MkAllocDir | ||
// creating the directory structure, but the actual root cause is still | ||
// TBD. The symptom is that any command you try to execute will result | ||
// in "permission denied" coming from os/exec. This test is imperfect, | ||
// the actual feature of running commands as another user works fine. | ||
// AllocID: uuid.Generate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we leave AllocID
out then the AllocDir.SharedDir
and AllocDir.AllocDir
then both end up directly under the tempdirthen ends up directly under the tempdir (which is also task.AllocDir
), instead of under the allocation ID one level down.
That is instead of /tmp/nomad_driver_harness-/alloc/:alloc_id
and /tmp/nomad_driver_harness-/alloc
we get /tmp/nomad_driver_harness-/alloc
and /tmp/nomad_driver_harness-
and maybe the latter has the wrong permissions now?
// use the task environment variables for determining the cgroup path - | ||
// not ideal but plumbing the values directly requires grpc protobuf changes | ||
parent := lookup(e.commandCfg.Env, taskenv.CgroupParent) | ||
allocID := lookup(e.commandCfg.Env, taskenv.AllocID) | ||
task := lookup(e.commandCfg.Env, taskenv.TaskName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And those changes would be platform-specific, which would be sad to have to plumb into the protobufs if we can avoid it. It's probably worth considering whether we want to have a "platform variables" field or something in the future so we're not trying to smuggle things in thru env vars that we might not necessarily want to expose to the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"platform variables" field
That's a clever idea I'm definitely going to forget, so I created #12468
Final CI run on V2 https://github.com/shoenig/nomad/actions/runs/2098586202 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR adds support for the
raw_exec
driver on systems with only cgroups v2.The raw exec driver is may use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2. Because the driver manages cgroups
directly and not via containerd, we must account for the v2 changes ourselves.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Unlike the v1 implementation, the v2 implementation respects the client
cgroup_parent
configuration option. This value is plumbed through to the driver via a task
environment variable, since it (and allocID / taskName) are not available otherwise
from the task config protobuf. A new environment variable
NOMAD_PARENT_CGROUP
is now part of the task runtime when supported.
A few irrelevant tests have been stubbed out with
SkipSlow
because they are very flakyon GHA, which skips slow tests. Circle is still configured to run slow tests.
Closes #12351 #12348
Partial #10251 (leaving open for the memory stats)
Reviewers:
Neither GHA or Circle are yet to offer machine types using only cgroupsv2. Instead of running everything manually I forked Nomad into
shoenig/nomad
and setup a temporary self-hosted GHA runner runningubuntu-21.10
. The run based onc8d2402
(this PR) is HERE. The show-system job indicates the cgroupv2 mount point on/sys/fs/cgroup
, which activates Nomad's cgroup v2 code paths and tests. Which you can't the output of unless you're me 😞