-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tens of thousands of open file descriptors to a single nomad alloc logs directory #2024
Comments
Hey could you post the open file descriptors in a gist? Couple questions:
|
|
@sheldonkwok can you attach the logs and the open file descriptors. Does this happen regularly? The memory profile will help a lot! |
Using the docker driver. Debug is on now so I'll have to catch it when it's in the bad state before it reboots. |
Bad news. Getting
repeated many many times. |
got anything in the client or server logs relevant to the issue? e.g. repeated actions ? |
A lot of |
We encountered this issue again after hitting the logging api for one of the allocs. It's crashing multiple production nomad clients. |
@sheldonkwok Can you show |
Nothing too crazy it seems. |
Does this reproduce every time you log or just occasionally. What is the log command you are running? Are the file descriptors open on the nomad client or the executor? |
We're not sure if it's every time or just occasionally. The descriptors are open on the client. |
@sheldonkwok I can not reproduce this on either 0.4.1/0.5.0/master. It would be great if you could get a vagrant box or some reproductions steps so I can debug this. As of now it is a bit hard since you can't even attach a profiler |
We're still working on some reproduction steps because nothing we do immediately crashes the system and we're alerted of issues later. |
Managed to get the client into a state where the memory low but the number of file descriptors open is still high. Here's the mem profile. |
Could you get the profile too and send both the heap and profile output
Thanks,
Alex
…On Dec 1, 2016, 3:16 PM -0800, Sheldon Kwok ***@***.***>, wrote:
Managed to get the client into a state where the memory low but the number of file descriptors open is still high. Here's the mem profile.
http://pastebin.com/XbPH9qhw
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Tree: http://pastebin.com/Up3kLmvi Does this help? |
We upgraded to Nomad 0.5 and we're still experiencing this issue when we make multiple requests to an allocation's log route simultaneously. Our use case is that many of our devs want to see logs for their service at the same time. |
@sheldonkwok Can you get me this: How are retrieving the logs? Programatically? |
Will do next time it happens. |
My current thought is you may never be closing the connection which is causing this. The goroutines should help correlate number of descriptors to times its been called (goroutines handling the request) |
I was considering that but the number of file descriptors jumped from dozens to the tens of thousands pretty quickly when we only made a handful of simultaneous requests. |
I reproduced this behavior with |
We unfortunately had to stop using the nomad logs/logs route because of this issue. We are instead ingesting the logs through Filebeat/Logstash/Elasticsearch. It doesn't handle streaming as well though. Will continue to try and reproduce though. |
@leonoff Awesome! Thank you! |
The file descriptor issue just creeped back on to our system today running Nomad 0.5.2... Will try to reproduce. Just letting everyone know. |
This has still been happening a lot for us lately. Today there was an actual panic - I opened a separate ticket here, but there was heavy utilization of logs during the time of the crash. I still have a feeling there's an edge case in here somewhere... |
@justinwalz Do you have a list of the file descriptors that were open? |
I got to the machine after the nomad process had already died, but I will get a dump next time it occurs live. |
Hi, it occurred again Nomad is using up a ton of RAM, and a high count of file descriptors
truncated output from
|
To add more context, I've attached the client logs. There are some errors about the logs API and too many open files included. For what it's worth, nomad healed itself - both memory usage back under control and file descriptors back to normal without needing to restart. |
@justinwalz Okay re-opening this. Do you have any reproduction steps? |
We're still running into this issue every week. It makes the server unresponsive and forces reallocations for all of the jobs on it. This issue happens across all of our jobs intermittently when we attempt to fetch logs. It's difficult to reproduce but seems guaranteed to popup with enough logging requests over time on a job. The shared logging config is
|
@sheldonkwok I will make time to play with this soon. I was never able to reproduce however. Are you running both server/client on one machine? |
Thanks Alex. We are running 5 servers on their own machines and the clients are run separately as well. |
Hey just tried the following and couldn't reproduce: Ran this job:
The job produces 10 MB of logs every few seconds and I streamed the logs of all 20 allocations to 3 different machines. I did this for 30+ minutes and transferred many GBs of logs from the client. @sheldonkwok I did look at your logs and it looks like the streaming connection is held for 50+ minutes but it is still streaming log stdout.0/stderr.0? Are the apps just not logging much? I am going to try that as well and will report back. |
Just tailed logs for over an hour and nothing as well. @sheldonkwok any more detail on the logging pattern would be great. |
Hey @dadgar, I really appreciate the efforts you've spent trying to reproduce it! A normal logging pattern that we see is multiple people tailing the logs of multiple services at the same time for many hours. Some of these services may be the same service as well. |
@sheldonkwok Hm unfortunately I did try that. So I had 3 nodes each running ~7 allocations. I then was following the logs of all 20 allocations on 3 different machines for over an hour and a half. I repeated this for two different workloads:
Neither reproduced. Further when I stopped the logging, not a single extra FD was used (as in there was no leaking that would eventually cause a problem). |
Unfortunately, we cannot reproduce it effectively either. It generally happens once a week. We have thousands of allocations being created and running daily over hundreds of machines. In addition, users can be tailing thousands of them at once. Thanks again @dadgar |
@sheldonkwok Do you retain Nomad log files? Would you be willing to run a debug adds some logging that may help get to the bottom of this? |
Sure, what do we set and how much extra logging does it generate? |
@sheldonkwok You will set |
@sheldonkwok I'm doing some cleanup of old issues and came across this one. The Nomad logging infrastructure has undergone many changes since this was originally opened. If you're still experiencing any problems would you please open a new issue? Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.4.1
Operating system and Environment details
Linux ip-10-201-5-129 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Issue
Nomad has tens of thousands of open file descriptors to an alloc log directory.
nomad 2143 root *530r DIR 202,80 4096 8454154 /var/lib/ssi/nomad/alloc/14e62a40-8598-2fed-405e-ca237bc940c6/alloc/logs
Something similar to that repeated ~60000 times.
lsof -p 2143 | wc -l
returns ~60000I stopped the alloc but the descriptors are still there.
In addition, the nomad process is approaching 55 GB of memory used.
The text was updated successfully, but these errors were encountered: