Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gh-6980] Client: clean up old allocs before running new ones using the exec task driver. #20500

Merged
merged 24 commits into from
May 14, 2024

Conversation

Juanadelacuesta
Copy link
Member

@Juanadelacuesta Juanadelacuesta commented Apr 30, 2024

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This PR introduces two mechanisms to avoid this behaviour:

  • Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
  • Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.

It fixes #6980

@Juanadelacuesta Juanadelacuesta force-pushed the gh-b-6980 branch 3 times, most recently from 8a6f933 to aaea742 Compare May 1, 2024 11:34
@Juanadelacuesta Juanadelacuesta force-pushed the gh-b-6980 branch 2 times, most recently from 0ebb10b to 603c40a Compare May 2, 2024 15:07
@Juanadelacuesta Juanadelacuesta changed the title func: set nomad as subreaper, clean processes in cgroup and wait for … [gh-6980] Client: clean up old allocs before running new ones using the exec task driver. May 8, 2024
@Juanadelacuesta Juanadelacuesta marked this pull request as ready for review May 8, 2024 16:48
client/lib/cgroupslib/editor.go Outdated Show resolved Hide resolved
.changelog/20500.txt Outdated Show resolved Hide resolved
go.mod Outdated
@@ -172,7 +172,7 @@ require (
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/checkpoint-restore/go-criu/v5 v5.3.0 // indirect
github.com/cheggaaa/pb/v3 v3.0.5 // indirect
github.com/cilium/ebpf v0.9.1 // indirect
github.com/cilium/ebpf v0.11.0 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a go.mod change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested a few things before, this is probably a left over of one of those attempts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we'll want to back this change out before merging.

drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
drivers/shared/executor/executor_linux.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking great @Juanadelacuesta. I've left one question about the signals we're masking, and then there's a couple remaining bits to mop up like the go.mod and then I think this will be ready to ship.

signal.Notify(l.sigChan, syscall.SIGHUP, syscall.SIGQUIT, syscall.SIGTERM, syscall.SIGINT, syscall.SIGSEGV)
for {
signal := <-l.sigChan
if signal == syscall.SIGTERM || signal == syscall.SIGKILL || signal == syscall.SIGINT {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGKILL isn't one of the signals we've masked above, and we can't catch it anyways. So this block really is only translating SIGTERM to SIGINT, right?

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work on this one!

@Juanadelacuesta Juanadelacuesta merged commit 169818b into main May 14, 2024
19 checks passed
@Juanadelacuesta Juanadelacuesta deleted the gh-b-6980 branch May 14, 2024 07:51
Juanadelacuesta added a commit that referenced this pull request May 14, 2024
…he `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
Juanadelacuesta added a commit that referenced this pull request May 14, 2024
…he `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
Juanadelacuesta added a commit that referenced this pull request May 15, 2024
…he `exec` task driver. (#20500) (#20584)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.

Co-authored-by: Juana De La Cuesta <[email protected]>
Juanadelacuesta added a commit that referenced this pull request May 15, 2024
…ones using the `exec` task driver. into release/1.6.x (#20583)

* [gh-6980] Client: clean up old allocs before running new ones  using the `exec` task driver. (#20500)

Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.

* fix: add the missing cgroups functions

---------

Co-authored-by: Juana De La Cuesta <[email protected]>
Co-authored-by: Juanadelacuesta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rogue Processes
2 participants