Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat holds open deleted file descriptions with close_removed #2608

Closed
blinken opened this issue Sep 21, 2016 · 3 comments
Closed

Filebeat holds open deleted file descriptions with close_removed #2608

blinken opened this issue Sep 21, 2016 · 3 comments
Labels
Filebeat Filebeat

Comments

@blinken
Copy link

blinken commented Sep 21, 2016

Hi,

The latest filebeat alpha seems to hold on to deleted file descriptors even with the config option close_removed: true. This eventually exhausts all available disk space. We typically see this when filebeat can't insert events into its output redis queue.

I think the issue here is partly of our own creation:

  • filebeat is sending events to a remote redis server, which occasionally fills up due to a lack of logstash capacity (ie. we see ERR Failed to RPUSH to redis list (OOM command not allowed when used memory > 'maxmemory'.) with %!v(MISSING) in the filebeat logs). In this scenario, I'd expect filebeat to start dropping events in files that get rotated away, but it seems that it just holds on to file descriptors - which eventually causes the source machine to run out of disk space.
  • We use runit for service/log management, and it can rotate log files very quickly - as fast as every few seconds if a service is logging at a high rate. I have tried lowering close_inactive to try and work around this, but this doesn't seem to have an effect if filebeat hasn't yet sent the events upstream.
  • We have services (particularly in dev) that sometimes log faster than 180k lines/sec.

Basically, we're doing bad things. But I'd expect filebeat to behave nicely in this case and close deleted file descriptors (drop events) if it can't upload anything - rather than causing the source machine to run out of disk space.

$ lsof -n -p 99804 | wc -l
788
$ ls -l /var/log/*/current | wc -l
128
$ lsof -n -p 99804 | tail -n20
filebeat 99804 root  768r   REG      202,0    306114 1989501 /var/log/service_1/current
filebeat 99804 root  770r   REG      202,0     51812 1989506 /var/log/service_2/current
filebeat 99804 root  771r   REG      202,0    999106 1971943 /var/log/service_3/@4000000057e24233011a8dfc.s (deleted)
filebeat 99804 root  775r   REG      202,0    841974 1980422 /var/log/service_4/current
filebeat 99804 root  779r   REG      202,0    895880 1989507 /var/log/service_5/current
filebeat 99804 root  794r   REG      202,0    999106 1971943 /var/log/service_3/@4000000057e24233011a8dfc.s (deleted)
filebeat 99804 root  800r   REG      202,0    999078 1989089 /var/log/service_4/@4000000057e243552075edcc.s (deleted)
filebeat 99804 root  811r   REG      202,0    999132 1971833 /var/log/service_4/@4000000057e2424a0fa1a7d4.s (deleted)
filebeat 99804 root  814r   REG      202,0   4999057 1971971 /var/log/service_6/@4000000057e24220345d828c.s
filebeat 99804 root  819r   REG      202,0    999132 1971833 /var/log/service_4/@4000000057e2424a0fa1a7d4.s (deleted)
filebeat 99804 root  839r   REG      202,0   4999019 1972009 /var/log/service_6/@4000000057e242c10c1bc11c.s
filebeat 99804 root  842r   REG      202,0    999106 1971943 /var/log/service_3/@4000000057e24233011a8dfc.s (deleted)
filebeat 99804 root  860r   REG      202,0    999048 1971882 /var/log/service_5/@4000000057e2425708aa44f4.u (deleted)
filebeat 99804 root  864r   REG      202,0    999078 1971791 /var/log/service_7/@4000000057e2433e3aec388c.s
filebeat 99804 root  869r   REG      202,0    999015 1971935 /var/log/service_8/@4000000057e243b927dcc7d4.s
filebeat 99804 root  873r   REG      202,0    999057 1972280 /var/log/service_9/@4000000057e2451f086eb6dc.s
filebeat 99804 root  891r   REG      202,0    999048 1971882 /var/log/service_5/@4000000057e2425708aa44f4.u (deleted)
filebeat 99804 root  894r   REG      202,0    999057 1972280 /var/log/service_9/@4000000057e2451f086eb6dc.s
filebeat 99804 root  895r   REG      202,0    999015 1971935 /var/log/service_8/@4000000057e243b927dcc7d4.s
filebeat 99804 root  898r   REG      202,0    999078 1971791 /var/log/service_7/@4000000057e2433e3aec388c.s

Rerunning lsof -n -p 99804 | tail -n20| grep '/var/log/service_5/@4000000057e2425708aa44f4.u' more than 5 seconds later shows the same file - ie. it's not being closed by close_inactive.

The relevant stanza from my filebeat.yml -

# runit
-
  # Paths that should be crawled and fetched. Glob based paths.
  # For each file found under this path, a harvester is started.
  paths:
    - "/var/log/*/current"

  # Type of the files. Based on this the way the file is read is decided.
  # The different types cannot be mixed in one prospector
  #
  # Possible options are:
  # * log: Reads every line of the log file (default)
  # * stdin: Reads the standard in
  input_type: log

  # Truncate messages to 32,000 bytes; ES will get sad if they're much larger
  # than this
  max_bytes: 32000

  # These three options attempt to protect disk space in the event runit goes
  # crazy. 
  close_inactive: 5s
  close_removed: true
  scan_frequency: 2s

  # Each log entry starts with a timestamp like 2016-06-06_12:30:31.77187
  multiline:
    pattern: '^[0-9]{4}-..-.._..:..:..\.[0-9]{5}[\t ]'
    negate: true
    match: after

  # The value to store in the type: field. Used to pick the right logstash
  # filter chain
  document_type: runit

  fields_under_root: true
  fields:
    runit_source: disk-archives

Key info:

  • Version: 5.0.0-alpha5
  • Operating System: Debian Wheezy / runit 2.1.1-6.2
  • Steps to Reproduce: Monitor high volume logs that are regularly rotated by runit.
@ruflin
Copy link
Member

ruflin commented Sep 21, 2016

@blinken Thanks for the detailed report. Could it be that you hit this issue here? #2541 Could you check if you still see it with the most recent snapshot? https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

@blinken
Copy link
Author

blinken commented Sep 22, 2016

Looking much better after a few minutes here on the snapshot.

Before -

$ sudo lsof -n -p 4705 | tail -n20
filebeat 4705 root 1003r   REG      202,0    999045    1985822 /var/log/service_a/@4000000057e353d438814b3c.s
filebeat 4705 root 1004r   REG      202,0    999071    1992921 /var/log/service_b/@4000000057e354232f256f64.u (deleted)
filebeat 4705 root 1005r   REG      202,0    999044    1989083 /var/log/service_c/@4000000057e352f9037abd4c.s
filebeat 4705 root 1006r   REG      202,0   4999129    1971693 /var/log/service_d/@4000000057e322691edb4cb4.s (deleted)
filebeat 4705 root 1007r   REG      202,0   4999148    1992944 /var/log/service_d/@4000000057e359561df9d5f4.s
filebeat 4705 root 1008r   REG      202,0    999064    1992950 /var/log/service_e/@4000000057e3573c0d4f5b34.s (deleted)
filebeat 4705 root 1009r   REG      202,0    999072    1972653 /var/log/service_b/@4000000057e31fdb29010454.u (deleted)
filebeat 4705 root 1010r   REG      202,0    999116    1992988 /var/log/service_e/@4000000057e35abf08b41c7c.s (deleted)
filebeat 4705 root 1011r   REG      202,0    999091    1972006 /var/log/service_e/@4000000057e354a80945e8b4.s (deleted)
filebeat 4705 root 1012r   REG      202,0    999092    1972740 /var/log/service_f/@4000000057e35f9e004ee92c.u (deleted)
filebeat 4705 root 1013r   REG      202,0    818158    1838084 /var/log/service_g/current
filebeat 4705 root 1014r   REG      202,0    999027    1966854 /var/log/service_a/@4000000057e35885291864b4.s
filebeat 4705 root 1015r   REG      202,0    999150    1992997 /var/log/service_b/@4000000057e35b902d0b97e4.u (deleted)
filebeat 4705 root 1016r   REG      202,0   4999148    1992944 /var/log/service_d/@4000000057e359561df9d5f4.s
filebeat 4705 root 1017r   REG      202,0    999082    1987536 /var/log/service_e/@4000000057e351230845e1bc.s (deleted)
filebeat 4705 root 1018r   REG      202,0    999128    1988926 /var/log/service_h/@4000000057e337ad133a9d74.s
filebeat 4705 root 1019r   REG      202,0    999066    1971628 /var/log/service_e/@4000000057e357780f0661d4.s (deleted)
filebeat 4705 root 1021r   REG      202,0    871840    1993015 /var/log/service_i/current
filebeat 4705 root 1022r   REG      202,0    999135    1989493 /var/log/service_j/@4000000057e3226110122c84.s (deleted)
filebeat 4705 root 1023r   REG      202,0    999068    1993025 /var/log/service_k/@4000000057e35ea0183d96dc.s
$ sudo lsof -n -p 4705 | wc -l
1007

Filebeat 6.0.0-alpha1-SNAPSHOT -

$ sudo lsof -n -p 34062 | tail -n20
filebeat 34062 root   45r   REG      202,0    829080 1967831 /var/log/service_l/current
filebeat 34062 root   46r   REG      202,0    338460 1971953 /var/log/service_m/current
filebeat 34062 root   47r   REG      202,0    943557 1972109 /var/log/service_n/current
filebeat 34062 root   48r   REG      202,0    468807 1972340 /var/log/service_o/current
filebeat 34062 root   49r   REG      202,0    228738 1989596 /var/log/service_p/current
filebeat 34062 root   50r   REG      202,0    662943 1989136 /var/log/service_q/current
filebeat 34062 root   51r   REG      202,0    164037 1966839 /var/log/service_a/current
filebeat 34062 root   53r   REG      202,0    855983 1971693 /var/log/service_r/current
filebeat 34062 root   55r   REG      202,0    114460 1968028 /var/log/service_i/current
filebeat 34062 root   59r   REG      202,0    976019 1971972 /var/log/service_s/current
filebeat 34062 root   63r   REG      202,0    902405 1985509 /var/log/service_t/current
filebeat 34062 root   66r   REG      202,0    197269 1967654 /var/log/service_u/current
filebeat 34062 root   73r   REG      202,0     38627 1989686 /var/log/service_v/current
filebeat 34062 root   79r   REG      202,0 139392617 1966289 /var/log/service_w/current
filebeat 34062 root   80r   REG      202,0    563141 1836822 /var/log/service_x/current
filebeat 34062 root   83r   REG      202,0   5438293 1985849 /var/log/service_y/current
filebeat 34062 root   87r   REG      202,0    813682 1992948 /var/log/service_z/current
filebeat 34062 root   92r   REG      202,0 170915030 1972335 /var/log/service_aa/current
filebeat 34062 root   94r   REG      202,0    738032 1993049 /var/log/service_ab/current
filebeat 34062 root   98r   REG      202,0   4999024 1966800 /var/log/service_d/@4000000057e3601a1749ff7c.s
$ sudo lsof -n -p 34062 | wc -l
88

@ruflin ruflin added the Filebeat Filebeat label Sep 22, 2016
@ruflin
Copy link
Member

ruflin commented Sep 22, 2016

@blinken Thanks a lot for testing. I think you hit the exact same bug that was fixed in #2541 This fix will also be part of the next beta1 release.

I'm closing this issue as I think it is fixed. In case the problem starts to appear again with the snapshot, please reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat
Projects
None yet
Development

No branches or pull requests

2 participants