-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high memory usage in logmon #9858
Comments
Hi @anastazya! The |
I can give the stats for another host in the same state, cause i restarted nomad on the host that i took the screenshot.
|
Ok, so the first thing I worried about was leaking Nomad processes. You've got 93 If we do a bit of math we can see that ~7GB of RSS are dedicated to A couple things that would be useful to help debug this:
It might also be useful to see if there's a relationship between the I've done some very quick tests here and I don't see any obvious leaks on the executor or logmon for a process that's doing frequent logging. But you may be encountering this issue more seriously because of the large number of tasks on the host. It looks like you're running a task-per- (Also, I hope you don't mind but I edited your last response to wrap that ps dump in a |
Nomad v1.0.0 (cfca640) The rest of the details i will compile today as this is somewhat production environment.
|
@anastazya I did some quick rummaging thru those numbers and it looks like there's no relationship between etime and memory (if anything, the executors seem to free up memory a tiny bit the older they are). But the logmon processes are grabbing a lot more RSS than I might expect, and there are two outlier logmon processes that are each eating up 1+GB of memory each. The primary way I'd expect to see this is if the logmon processes have queued up a lot of messages that aren't yet committed to disk, although I'm surprised that wouldn't result in dropped messages first... I need to dig into that code. In any it might be worth looking at the host's disk IO utilization. With that clue, I'll see if I can set up a reproduction that'll cause logmon's memory to expand unexpectedly. |
Well that was totally off-base... we have a fifo with blocking reads with a small buffer, so that's just not something that would happen without also blocking the log writer of your application. I've been trying to get a reproduction here by having an The total RSS does seem proportion to log volume though: if I tune the throughput of the logs by sending small web request logs instead of literally running With more information about your log load, I might be able to narrow things down. Do you have a |
The logs config across all jobs is like this :
I will get back with the log volume, needs investigation. |
That's very interesting... I'd expect that to result in very rapid log rotation if the volume of logs is reasonably high, which might be triggering buffering while the rotation is happening. That gives me something else to look at, at least, but the log volume would be interesting to see for sure. May I ask why such a low value for these? |
We thought in the beginning to log only bear minimum and minimise the storage impact and IOPS usage. But if you say this has a great impact on log rotating, we will change those values. Any recommendation ? |
That entirely depends on the log volume. If you're only sending a trickle of logs it doesn't matter. But you're sending >1MB of logs a second then you're rotating the logs continuously. You almost certainly want the logs to be rotating less often than it the time it takes to actually rotate them. We don't compress the logs, so it should be quick unless the disk is getting hammered (which of course it might be if you're writing that many logs!). It's definitely worth comparing the log volume to the overall IOPS usage. |
I will be doing in depth tests next week and write here if i find something LOG related. As far as i see the logging is not that aggressive. |
I've encountered a scenario where Nomad consumed a lot of memory (over 5GB), I eventually found out it happened because of tasks that keep restarting forever due to insufficient resources or bad arguments. Edit: |
I can assure you that all the tasks run well and don't restart with errors. |
I looked at it a bit more, the issue I described is referring only to when Nomad agent uses high memory. |
@Nickleby9 let's open that as a new issue then, so as not to confuse this issue. |
I still need to come back to review this, but just saw that go1.16 dropped with the following improvement (ref https://golang.org/doc/go1.16#runtime):
I suspect this is going to help us out in Nomad quite a bit because logmon is in the allocation resource boundary. I'm not sure if we'll ship that in the upcoming Nomad 1.0.4, but certainly for Nomad 1.1. |
@tgross sorry, any updates? I can see this problem still not solved in 1.1 version of Nomad? |
Hi @bubejur we just shipped Nomad 1.1 a couple days ago, so I'm sure @anastazya hasn't had a chance to try it out in their environment to verify whether |
Okay. As i can see it: |
Ok, please open a new issue if you're having problems with log monitoring of your tasks rather than piling on this one. |
pmap -xp nomad agent process
ps output take Wed Jul 7 12:50:52 UTC 2021
|
pmap -xp nomad agent process 1h later
ps output take Wed Jul 7 14:20:47 UTC 2021
|
pmap -xp nomad agent process 2h later
ps output take Wed Jul 7 16:22:56 UTC 2021
|
@notnoop how you doing? |
@bubejur Hey. This ticket is on my list for this week to investigate and follow up. Thank you for your patience. |
Nomad v1.1.2 (60638a0) After updating to v1.1.2, we still see the same issue with nomad logmon. The only resolve seems to be restarting nomad ( systemctl restart nomad ) |
I implemented a cron job across all our nomad clients that executes "systemctl restart nomad && swapoff -a && swapon -a" every 24 hours, at random minutes ( not all at once ). So far i managed to keep memory usage under the OOM limit, but this method might have side effects that we are not yet aware of. |
@anastazya i guess it might not to be production solution :-) |
Desperate times call for desperate measures. I have far too many nodes and tasks to just wait. I needed a fast solution to this OOM problem, cause tasks that failed when OOM manifests lead to data loss and that's huge issue for my setup. |
got same on 1.1.4 :-( |
@bubejur Thanks for your patience. I was OOO for awhile and catching up now. I have built a binary to instrument logmon to emit memory heap/allocs profiles in the When you notice a high memory usage of a logmon process, you can send it a With the pprof files, it'll be easier to identify the cause of high memory usage and test hypothesis for fixing it. Thank you so much again for your patience. |
I will be freeing up some time in the next days and run that binary against some of our servers to get some info. |
[nomad-logmon-pprof-allocs-1963225446.txt](https://github.com/hashicorp/nomad/files/7280540/nomad-logmon-pprof-allocs-1963225446.txt |
Thanks @bubejur ! The profiles you included highlight a memory leak! There were 7,395 instances of buffered writers, each 64kb (accounting for 473.28MB). I was able to reproduce the memory leak and have the fix in #11261 . Can you confirm that you have some tasks that are getting restarted or signaled frequently? That may explain it. I'll be curious if there is another cause or contributing factor? |
@notnoop yes, it's like cron tasks, but they are working as a "service". Also can you tell me how can i fix error like: |
Great. The fix was out in 1.1.6 release of yesterday - give a try and let us know of any questions. Sadly, the "transport is closing" error messages are a nuance and not actionable - we should gracefully handle the case better. It's tracked in #10814 . |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.
At random intervals after a while i get a lot of OOM's.
I find the server with 100% swap usage and looking like this :
If i restart the nomad agent, it all goes back to normal for a while.
I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.
The text was updated successfully, but these errors were encountered: