-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close open files when reaching "too many open files" to avoid other errors #12068
Comments
Hi @Mekk, Thanks for your proposal. It'd be good to discard some file descriptors leak here, do you actually have so many files being harvested by filebeat? |
Oh, I see more info in #12069 🙂 |
Indeed (I faced implicit 1024 files limit, my harvester limit was bigger). I indeed have that many files (apps which use separate-log-per-process, using pids and dates in lognames) BTW: restarting doesn't help. After restart filebeat opens 1024 (or whatever the limit is) files and starts reporting DNS lookup failures. Tested. |
Regarding workarounds: I didn't test but I bet that an alternative (to raising file limit) is to lower harvester_limit's (so their sum is noticeably below descriptor limit). And to put thing in context: in general it is my configuration error (that I didn't guarantee that sum of my harvester_limit's is noticeably lower than system-induced descriptor limit). In my very case I simply didn't know that my descriptor limit is 1024 (I expected 65536 as limits.conf and ulimit declared). (to a degree this issue is about self-regulation instead of manual harvester_limit regulation, if it is implemented, harvester_limit may turn out to be less necessary and function mostly as a cap for less important log groups) |
Ok, I see, thanks for the details. I am not sure if keeping a hard limit on the pool size of open files will be so good, keeping the files open makes filebeat to collect logs as soon as they are written, and prevents it to be continuously closing and reopening files to check for new contents. But I agree that this would be better than reaching a point of continuous failures. There are some options to control when files are kept open, for example if your applications are not logging continuously, one thing you can try is to reduce the |
I don't suggest hard limit. I suggest handling this very specific error – once DNS resolution, socket open to upstream or similar operation returns errno=24 (or whatever is Go wrapper for this condition), we know that we exceeded the limit so we must close something to get the most important descriptors working. |
Another interesting failure case:
this descriptor seems crucial too… |
Yes, lots of things can fail if open files limit is reached, this is why I wrote about the possibility of setting a limit in the number of open harvesters, I find it easier to limit that than to control all the cases where this can be a problem. @Mekk did you try to reduce |
@Mekk there is also a |
I know about harvester_limit and use it (see comment above #12068 (comment) ). I can't close inactive too soon as I have some "do sth twice a day" components. In general I know that I can configure this better and I do it. This bug is about improving the behaviour when somebody misconfigures. And as I already said: it is fairly difficult to keep harvester_limit-to-descriptor limit in control when there are a lot of inputs. As filebeat knows that it reached the limit (this is specific distinguishable error), it can try to manage the problem if it happens (by prioritizing "push" sockets and descriptors to those "pull"-ing data, in practice by closing sth like 10-20 watching descriptors and keeping the global cap afterwards if the condition happens). After all it doesn't make sense to keep watching if any attempts to forward data to upstream fail. |
Just faced the problem again, on filebeat 7.5.1 |
If I understood your request correctly, you want Filebeat to close files if it cannot open any more files, e.g if it encounters backpressure. But by adding this functionality we would risk users losing data without letting them opt-out. We cannot add heuristics to Filebeat to select which file to close. What if the selected file is deleted before Filebeat is able to process more events? Unfortunately, we cannot retrieve that lost file for the user. Right now we are trying to get all data from all configured paths and let users explicitly configure
For your case, I suggest you set |
Could you please read my description with some care? In the state I describe you are loosing all the data. Absolutely everything. Filebeat is unable to push data upstream and is unable to save it's own state. |
Is it possible to reopen the issue somehow? It is clear that zube, who closed it, misunderstood it. It's not about heuristically choosing what to push, and what not. |
I am sorry if I did not express myself clearly. That's not what I meant. I meant choosing what files to close. If you are setting the option If you want to be more aggressive you can set You need to adjust these options to fit your use case and infrastructure and to minimize potential data loss. Let me know if these options are working for you. |
I think that the decision „what files to close” is exactly the same as the decision „which files not to open yet” when Proper solution is of course to always set Once more: here is the situation filebeat can enter as it is now. Imagine we have default ulimit of 1024 descriptors (systemd default filebeat packages preserve) and it happens we have 1030 files in log directory:
So filebeat keeps watching all those files, reporting errors with upstream connection and reporting errors with registry. And stays so forever, not doing anything except error reporting and pumping some data from logs to internal memory. Moreover, even restart is troublesome as it can't save registry to preserve the state. What I suggest is that if we detect that (attempt to save registry or attempt to connect upstream fails with error "Too many open descriptors") filebeat should close a few logs to be able to start working again. Keeping them open won't help with anything, after all. And those files are not to be abandoned, they will be handled later, when sth else is closed (due to inactivity, file removal, etc) – again, just like it is with logs left unmonitored due to Another, less efficient but fairly possible option, would be to close all logs in such a case, internally reduce all |
@Mekk thanks for your thoughtful explanations, but the more I think on this the more I think that implementing a solution like this one would add quite a bit of complexity and it would provide little benefit. Let me try to explain. Complexity added:
Little benefit:
Please notice that current situation is not so bad. In the worst case a user would see that Filebeat is not working, they will check the logs and they will quickly see many "too many open files" errors. Then they could decide what to do, as they would do with any other software that hits some limit. Maybe increasing the limits is enough (I actually agree that increasing the default limits can be a good thing to do #12069). If increasing the limits is not possible, they can try what of the many options Filebeat offers works better for their use case, as the options @kvch and I have proposed. I may be completely wrong, or missing something 🙂 but really I think that this is a problem that when happens can be easily solved by configuration, and it is really hard (if even possible) to solve well by changing default Filebeat behaviour. |
In setup with 74 filebeats working on various machines it may not be that immediately obvious that 3 of them stopped to work. Especially when ELK is used mostly to search for problems and alert about them (query didn't find errors, dashboard is green…). And logs need not be that accessible if they are usually looked at via Kibana and filebeat stopped to provide it's own logs too. With normal setups situation is usually not that really bad, but think also about filebeat's running as sidecar containers in kubernetes pods.
I'd say that simply anything except reading. From the outside-user point of view filebeat seems to have three main areas of operation - reading (handling incoming data, whatever is the source), writing (pushing data out, whatever is the destination) and self-management (reading configuration, writing state, writing logs etc). Prioritizing writing and self-management against reading seems rather natural back-pressure technique, after all it doesn't make much sense to grab data if it can't be handled. Well, I understand that it is too hard to implement directly, but maybe think about this as some general problem for the future.
Another approach to the whole story could be do sth very big: if we have persisting problem with state management (whatever the problem is), perform big reset (close everything and start from scratch, and while starting, open more important thing first). From my registry-related discussions I understand that having up-to-date state is considered to be very important by some users (there is some reason that by default state is flushed everytime sth happens, even if that brings huge CPU cost). So … filebeat finds itself unable to save state. And again. And again. And again. And again. And nothing happens? The state problem is realistic, in my cases the failed attempt to write As a sidenote: #12069 is trappy. You upgrade from init.d-based system to systemd-based one. You have evidence of tuning done earlier, so you remember about moving limits.conf, sysctl-s and such. You keep using packaged filebeat. Out of the sudden you get hit by 1024 descriptors limit. That's not filebeat fault, but… |
I'm also facing this issue and, like @Mekk , have had issues noticing that one log source has stopped transmitting when the rest of our cluster is working well. While the concerns raised by @jsoriano are quite valid as well and more intelligent configuration may certainly be required in some circumstances holding a small pool of file handles (5-10?) to allow filebeats to be sure of recording the registry and having ports for upstream transmission would seem to be an intelligent default. As it stands, for me right now, after restarting filebeats I'm immediately back in the same state because of the large number of unread files. I'm going to need to adjust my configuration and restart to apply some of these recommendations which will then, because I can't save the registry, result in a round of retransmission of the same data. A small pool of reserved file handles would have solved this problem, much like the |
I got the same issue, |
What happened
My filebeat faced
too many open files
after some log rotations. Happens. But once it faced this condition, it also started reporting errors likeand stopped pushing any data upstream. So old logs stayed unpushed, new logs appeared so there were more and more files etc etc.
What could be done better
In my opinion filebeat could handle this situation better - after facing this specific error (too many open files) on "upstream-related" operation (DNS query, upstream connection) it should
close some logfiles to make room for those operations. This way the load would be reduced over time instead of piling up.
The text was updated successfully, but these errors were encountered: