-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad client high memory usage #11547
Comments
Thanks for the report @bubejur Does this happen in clients or servers? Or does it affect multiple agents? And just to give us an idea of scale, would you be able to share rough estimates of you cluster size? Like number of servers, clients, jobs etc. Thanks! |
It happens on clients. Affect multiple agents/not all. I think cluster size is not a reason. |
we already done something with that with @tgross |
but memory leak is still exists |
old issue is: #9858 |
We are seeing similar behaviour throughout all our workers, the same OOM effect but now the progression is slower ( ~ 3 weeks ). We are investigating this and we will post more relevant data as we have. |
In order to try and help, i'm providing output of
Please let us know what other info we might collect and share if it helps. |
Another one bit the dust today.
systemctl restart nomad fixes the problem temporary. We re-enabled the daily cron restart. |
This looks like it's #9858 but that was supposed to have been fixed in 1.1.6, so it looks like we might have missed something. We may need to try to build custom binaries again with the patch we did in release-1.1.5...b-logmon-inspect and then dig into the memory profiles again. Some options we have here:
|
Thanks @tgross for taking it on. I wanted to note that issue #9858 and #11261 focused on the logmon tasks. @bubejur The screenshot above highlights the nomad agent processes rather than logmon; can you confirm if you are still seeing high logmon usage as well? I suspect that another memory leak exists in the agent side and and that the pprof endpoint data will be sufficient without needing a custom binary. |
Thanks for pointing that out @notnoop. I want to back up a bit and get more specific about the symptoms and diagnostics, because the information provided is looking at two different things and there's not a lot of context for what's been given. (1) The (2) The The first question this raises for me is that this agent is just a client and not also a server, right? A production client should not also be a server, because the server needs to use a lot of memory to hold the state store and that makes that memory unavailable for workloads. Assuming that's correct, then @notnoop's suggestion to get pprof data is the next move. Set
You can then bundle those files up into a tarball and send it to [email protected] with this issue number in the subject line. Alternately, you can run Once that's done I'll acknowledge receipt on this issue and post back once I've done an analysis. Thanks! |
@bubejur we've asked for two things: A question we need the answer to: Is the Nomad agent running as both a server and a client? If so, then I would expect the memory usage to be very high as the server holds all the state. This is not a recommended production practice. If the answer to that question is no... What to do next Set
Those endpoints will cause a file to be downloaded. You can then bundle those files up into a tarball and send it to [email protected] with this issue number in the subject line. |
Ok. Done, waiting for high memory usage! |
@bubejur we're still looking for the answer from you about the first question:
|
yes, you're right
this is server side |
Sorry, maybe I'm not communicating well. What I'm asking is if the agent that's using a lot of memory is configured for both |
@tgross
client side:
|
also sent email [email protected] with tarball! |
Thanks @bubejur. I took a look at that heap profile with My suspicion is that we're either leaking a goroutine or that we're leaking these join context objects in the prestart tasks. It might help if you ran the same profiles on the same machine again now, because that would tell us which section was growing. |
Oh... thanks, i have only two nodes with those tasks. I stopped one of them and all tasks're now running only at 1st node. Also i restarted nomad agent on it. Gonna tell you more this evening. |
@bubejur what I was really looking for there was growth of the same process memory over time, so if you could get the heap dump from the agent, and then get it again from the same process some time later (say 4 or 5 hours), that'd be useful. |
@tgross sent you 1st part of files. Also will send ypu 2nd tomorrow morning for example. |
Morning! Sent you 2nd part of files |
@tgross - I sent an e-mail containing dumps before OOM and dumps just as the server was going into swap. Hope they help ! |
Thanks. I've been swamped looking into another issue the last few days but I hope to look at those dumps soon. |
Ok, looks pretty clear that we're leaking those join contexts. Those values keep climbing and all the associated goroutines are parked. I'll see if I can work up a reproduction that doesn't involve having to run a cluster for an extended period of time and get back to you folks once I've done so. |
Dropping some more detail on my comment above. The goroutine dump from most recent shows that we've got 1532687 goroutines and 1529664 of them are in the joincontext select found at
The heap space likewise shows that we've got a bunch of memory being used by the runtime (to be expected b/c of all those goroutines) and 236+MB of space from allocating new joincontexts at
I've been digging thru the code and I'm still trying to work up a reproduction, but I feel pretty confident that joincontext is the object being leaked at this point. |
Ok, I've got a working reproduction here and it looks like we needed to make sure we're canceling the join context even though I'd have expected the parent contexts to have done that. I've got a patch below that fixes the leak (also on branch We also have this exact same pattern at I'm getting towards the end of my day here, but I'll pick this back up again tomorrow. Thanks for your patience @anastazya and @bubejur! $ git diff client/allocrunner/taskrunner/task_runner_hooks.go
diff --git a/client/allocrunner/taskrunner/task_runner_hooks.go b/client/allocrunner/taskrunner/task_runner_hooks.go
index c9ff34441..72a509e05 100644
--- a/client/allocrunner/taskrunner/task_runner_hooks.go
+++ b/client/allocrunner/taskrunner/task_runner_hooks.go
@@ -237,7 +237,9 @@ func (tr *TaskRunner) prestart() error {
// Run the prestart hook
// use a joint context to allow any blocking pre-start hooks
// to be canceled by either killCtx or shutdownCtx
- joinedCtx, _ := joincontext.Join(tr.killCtx, tr.shutdownCtx)
+ joinedCtx, joinedCancel := joincontext.Join(tr.killCtx, tr.shutdownCtx)
+ defer joinedCancel()
+
var resp interfaces.TaskPrestartResponse
if err := pre.Prestart(joinedCtx, &req, &resp); err != nil {
tr.emitHookError(err, name) Reproduction I took this batch job: jobspecjob "example" {
type = "batch"
datacenters = ["dc1"]
parameterized {
payload = "required"
}
group "group" {
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "cat local/payload.txt; sleep 1"]
}
dispatch_payload {
file = "local/payload.txt"
}
resources {
cpu = 64
memory = 64
}
}
}
} Run this job and dispatch it a few hundred times by letting this run for a while:
Apply the patch and run again. After waiting until all the tasks have complete, there are no more joincontexts floating around once all the tasks have completed:
|
Going to add an addendum (mostly for my own notes) that I suspect the reason we're seeing this so easily with batch jobs is that they exit on their own, rather than being shutdown by the client. Service jobs run until they crash (not frequent) or until the server stops them, which calls one of the two contexts that make up the joined context. So I suspect we're not closing out the contexts at the end of the task runner run like we should be; I'll test that theory tomorrow. |
@tgross that's looks great! Thank's a lot for your time! |
A small program that closely matches how Nomad's taskrunner is set up demonstrates the problem: main.gopackage main
import (
"context"
"fmt"
"log"
"net/http"
_ "net/http/pprof"
"os"
"os/signal"
"runtime"
"syscall"
"time"
"github.com/LK4D4/joincontext"
)
const numTaskRunners = 100
// you can play around with batchTime vs pauseTime to take
// a goroutine dump at various stages to see the behavior
const batchTime = 30
const pauseTime = 5
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
shutdownCtx, shutdownFn := context.WithCancel(context.Background())
go trap(shutdownFn)
serviceRunners := []*TaskRunner{}
fmt.Println("starting runners")
for i := 0; i < numTaskRunners; i++ {
r := NewTaskRunner(shutdownCtx, "service")
serviceRunners = append(serviceRunners, r)
go r.Run()
}
for i := 0; i < numTaskRunners; i++ {
r := NewTaskRunner(shutdownCtx, "batch")
go r.Run()
}
// make sure we have at least one runner going so that we can
// expect to see exactly 1 goroutine for joincontext when we're
// done
waiter := NewTaskRunner(shutdownCtx, "service")
go waiter.Run()
time.Sleep(pauseTime * time.Second)
fmt.Println("stopping runners")
for _, r := range serviceRunners {
r.Stop()
}
runtime.GC()
<-shutdownCtx.Done()
}
func trap(shutdownFn context.CancelFunc) {
c := make(chan os.Signal, 1)
signal.Notify(c, syscall.SIGINT)
<-c
shutdownFn()
}
type TaskRunner struct {
taskType string
preStartHooksRun int
shutdownCtx context.Context
killCtx context.Context
killCancelFn context.CancelFunc
waitCh chan struct{}
}
func NewTaskRunner(shutdownCtx context.Context, taskType string) *TaskRunner {
ctx, cancel := context.WithCancel(context.Background())
return &TaskRunner{
taskType: taskType,
shutdownCtx: shutdownCtx,
killCtx: ctx,
killCancelFn: cancel,
waitCh: make(chan struct{}),
}
}
func (tr *TaskRunner) Run() {
joinedCtx, joinCancel := joincontext.Join(tr.killCtx, tr.shutdownCtx)
// NOTE: leave this out and we'll end up with 100
// leaked join contexts from the batch job
defer joinCancel()
for i := 0; i < 10; i++ {
tr.prestartHook(joinedCtx)
}
go tr.runTask()
}
func (tr *TaskRunner) prestartHook(ctx context.Context) {
tr.preStartHooksRun++
}
// runTask simulates the task plugin interface
func (tr *TaskRunner) runTask() {
if tr.taskType == "batch" {
go func() {
<-time.After(batchTime * time.Second)
close(tr.waitCh)
}()
}
select {
case <-tr.killCtx.Done():
case <-tr.shutdownCtx.Done():
case <-tr.waitCh:
}
}
func (tr *TaskRunner) Stop() {
tr.killCancelFn()
}
func (tr *TaskRunner) Kill(ctx context.Context) error {
tr.killCancelFn()
select {
case <-tr.waitCh:
case <-ctx.Done():
return ctx.Err()
}
return nil
} |
#11741 is up with that patch and should go out with the next scheduled release (shortly after the new year), but it'd be a safe patch to backport to your own build if you want to land a fix in your environment before that. |
Thanks a lot! Will be waiting for new release! |
@notnoop @tgross hi guys! I made an update for 1.2.4 but got another issue with consul templating:
Nomad side:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.1.6 (b83d623)
Operating system and Environment details
Centos7
Issue
High memory usage by nomad processes
Reproduction steps
It's not frequently issue, but you can try to reproduce
Expected Result
not so much memory consumption
Actual Result
it's currenlty too big
Job file (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: