Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

Closed
breath-co2 opened this issue Mar 19, 2018 · 4 comments
Assignees
Labels

Comments

@breath-co2
Copy link

breath-co2 commented Mar 19, 2018

Our fluentd(v1.1) log system have billion logs per day. When the fluent out forward node fails more than a few hours, the fluentd save millions of buffer files, and 100% cpu usage. In fluentd 0.12.* is not that.

I think it is caused by some defects. In my fluentd config, chunk_limit_size is 4m, and flush_interval is 1s, I want fluentd foward plugin send data as quickly as possible. In fluentd v0.12 if forward node is error, it will save buffer files 4MB per file, but in fluentd v1.1, it will save buffer file every second, awful, it save million files. so, the problem will appear flush_mode is interval(and flush_interval is small) or immediate.

It's test config:

<source>
  @type dummy
  tag log.test
  dummy {"hello":"world"}
</source>

<match log.*>
  @type forward
  heartbeat_interval 10s
  require_ack_response true
  ack_response_timeout 180

  <buffer tag>
    @type file
    path /data/fluentd-buffer/test/
    flush_mode interval
    flush_interval 1s               # flush every second, each chunk max size 4MB
    chunk_limit_size 4m
    retry_forever true
    retry_max_interval 3
    queue_limit_length 5000
  </buffer>

  <server>
    host 127.0.0.1    # error node
    port 9982
  </server>
</match>

then run td-agent -c test.conf and run ls -al /data/fluentd-buffer/test/:

-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cc51e092dea4fe2ce53920b93b.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cc51e092dea4fe2ce53920b93b.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cd87de6afc959a0c2699f29e2e.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cd87de6afc959a0c2699f29e2e.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cec19469d897bf3af0260362ab.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cec19469d897bf3af0260362ab.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cfc5fd5748ad87d1c2fbcef7ae.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cfc5fd5748ad87d1c2fbcef7ae.log.meta

You can see, it save 4 buffer files and 4 meta files, however, in fluent 0.12 only 1 buffer file. so, if it happens in a highly loaded server, it will be very scary.

@repeatedly
Copy link
Member

You can see, it save 4 buffer files and 4 meta files, however, in fluent 0.12 only 1 buffer file

The problem is flushing mechanizm change since v0.14. I will check it.

100% cpu usage

#1901 this patch may fix this problem.

@repeatedly repeatedly self-assigned this Mar 19, 2018
@repeatedly repeatedly added the v1 label Mar 19, 2018
@breath-co2
Copy link
Author

I'm not sure if it is. but I have update fluentd to v1.1.2 and test it. I think it still has problems. when flush_interval 1s, it also save a buffer file every second, chunk_limit_size 4m has no effect.

@repeatedly
Copy link
Member

Patch for this > #1916

@repeatedly
Copy link
Member

Release v1.1.3 with queued_chunks_limit_size parameter in <buffer>.
This problem should be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants