-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task runner: fix goroutine leak in prestart hook #11741
Conversation
94dd2c0
to
9cc522d
Compare
9cc522d
to
054e383
Compare
The task runner prestart hooks take a `joincontext` so they have the option to exit early if either of two contexts are canceled: from killing the task or client shutdown. Some tasks exit without being shutdown from the server, so neither of the joined contexts ever gets canceled and we leak the `joincontext` (48 bytes) and its internal goroutine. This primarily impacts batch jobs and any task that fails or completes early such as non-sidecar prestart lifecycle tasks. Cancel the `joincontext` after the prestart call exits to fix the leak.
054e383
to
e1bbf30
Compare
Note for reviewers: #11547 (comment) has the analysis that the patch fixes the leak. As far as correctness goes, in addition to the usual automated testing, I took the following two jobs and made sure everything looked as expected under operations like service jobjob "example" {
datacenters = ["dc1"]
group "web" {
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "setup" {
lifecycle {
hook = "prestart"
sidecar = false
}
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "cp local/index.html /alloc/index.html"]
}
template {
data = "<html>hello, world</html>"
destination = "local/index.html"
}
resources {
cpu = 128
memory = 128
}
}
task "sidecar" {
lifecycle {
hook = "prestart"
sidecar = true
}
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "echo 'sidecar running'; sleep 600"]
}
resources {
cpu = 128
memory = 128
}
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/alloc"]
ports = ["www"]
}
resources {
cpu = 128
memory = 128
}
}
}
}
batch jobjob "example" {
type = "batch"
datacenters = ["dc1"]
parameterized {
payload = "required"
}
group "group" {
task "setup" {
lifecycle {
hook = "prestart"
sidecar = false
}
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "cp local/index.html /alloc/index.html"]
}
template {
data = "<html>hello, world</html>"
destination = "local/index.html"
}
resources {
cpu = 128
memory = 128
}
}
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "cat local/payload.txt; cat alloc/content.txt; sleep 1"]
}
dispatch_payload {
file = "local/payload.txt"
}
resources {
cpu = 64
memory = 64
}
}
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work in the investigation! I checked the pre-start hooks themselves and none of them actually listen for a cancellation, so this shouldn't have any unexpected side-effects.
// to be canceled by either killCtx or shutdownCtx | ||
joinedCtx, _ := joincontext.Join(tr.killCtx, tr.shutdownCtx) | ||
joinedCtx, joinedCancel := joincontext.Join(tr.killCtx, tr.shutdownCtx) | ||
defer joinedCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a new joinedCtx
per hook necessary? We can reuse the same one for all prestart hooks.
Also, I haven't touched this code in so long - would this accidentally risk stopping goroutines that launched by prestart hooks if they rely on the passed context? If we use a single joinedCtx, we can save it and cancel it in exited/stop hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a new joinedCtx per hook necessary? We can reuse the same one for all prestart hooks.
Oh, that's a good point. We could move this up to the top of the loop (but we can't pass it into this prestart
method from the caller in task_runner.go
because then we don't get to use the defer
). Will do that.
would this accidentally risk stopping goroutines that launched by prestart hooks if they rely on the passed context?
Yes, but I went through and verified we're not doing that currently (as @lgfa29 has noted most of the prestart hooks don't even use the context). I feel fairly confident saying that saving the context in a goroutine in a prestart hook is not a behavior we should be doing at all, given how we treat prestart hooks? I can add a note to warn future developers about it, at least.
The task runner prestart hooks take a `joincontext` so they have the option to exit early if either of two contexts are canceled: from killing the task or client shutdown. Some tasks exit without being shutdown from the server, so neither of the joined contexts ever gets canceled and we leak the `joincontext` (48 bytes) and its internal goroutine. This primarily impacts batch jobs and any task that fails or completes early such as non-sidecar prestart lifecycle tasks. Cancel the `joincontext` after the prestart call exits to fix the leak.
The task runner prestart hooks take a `joincontext` so they have the option to exit early if either of two contexts are canceled: from killing the task or client shutdown. Some tasks exit without being shutdown from the server, so neither of the joined contexts ever gets canceled and we leak the `joincontext` (48 bytes) and its internal goroutine. This primarily impacts batch jobs and any task that fails or completes early such as non-sidecar prestart lifecycle tasks. Cancel the `joincontext` after the prestart call exits to fix the leak.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #11547
The task runner prestart hooks take a
joincontext
so they have theoption to exit early if either of two contexts are canceled: from
killing the task or client shutdown. Some tasks exit without being
shutdown from the server, so neither of the joined contexts ever gets
canceled and we leak the
joincontext
(48 bytes) and its internalgoroutine. This primarily impacts batch jobs and any task that fails
or completes early such as non-sidecar prestart lifecycle tasks.
Cancel the
joincontext
after the prestart call exits to fix theleak.