v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

breath-co2 · 2018-03-19T07:34:46Z

Our fluentd(v1.1) log system have billion logs per day. When the fluent out forward node fails more than a few hours, the fluentd save millions of buffer files, and 100% cpu usage. In fluentd 0.12.* is not that.

I think it is caused by some defects. In my fluentd config, chunk_limit_size is 4m, and flush_interval is 1s, I want fluentd foward plugin send data as quickly as possible. In fluentd v0.12 if forward node is error, it will save buffer files 4MB per file, but in fluentd v1.1, it will save buffer file every second, awful, it save million files. so, the problem will appear flush_mode is interval(and flush_interval is small) or immediate.

It's test config:

<source>
  @type dummy
  tag log.test
  dummy {"hello":"world"}
</source>

<match log.*>
  @type forward
  heartbeat_interval 10s
  require_ack_response true
  ack_response_timeout 180

  <buffer tag>
    @type file
    path /data/fluentd-buffer/test/
    flush_mode interval
    flush_interval 1s               # flush every second, each chunk max size 4MB
    chunk_limit_size 4m
    retry_forever true
    retry_max_interval 3
    queue_limit_length 5000
  </buffer>

  <server>
    host 127.0.0.1    # error node
    port 9982
  </server>
</match>

then run td-agent -c test.conf and run ls -al /data/fluentd-buffer/test/:

-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cc51e092dea4fe2ce53920b93b.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cc51e092dea4fe2ce53920b93b.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cd87de6afc959a0c2699f29e2e.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cd87de6afc959a0c2699f29e2e.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cec19469d897bf3af0260362ab.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cec19469d897bf3af0260362ab.log.meta
-rw-r--r-- 1 root root 15 3月  19 14:45 buffer.q567be4cfc5fd5748ad87d1c2fbcef7ae.log
-rw-r--r-- 1 root root 74 3月  19 14:45 buffer.q567be4cfc5fd5748ad87d1c2fbcef7ae.log.meta

You can see, it save 4 buffer files and 4 meta files, however, in fluent 0.12 only 1 buffer file. so, if it happens in a highly loaded server, it will be very scary.

The text was updated successfully, but these errors were encountered:

repeatedly · 2018-03-19T08:26:10Z

You can see, it save 4 buffer files and 4 meta files, however, in fluent 0.12 only 1 buffer file

The problem is flushing mechanizm change since v0.14. I will check it.

100% cpu usage

#1901 this patch may fix this problem.

breath-co2 · 2018-03-26T10:10:37Z

I'm not sure if it is. but I have update fluentd to v1.1.2 and test it. I think it still has problems. when flush_interval 1s, it also save a buffer file every second, chunk_limit_size 4m has no effect.

repeatedly · 2018-03-30T10:08:24Z

Patch for this > #1916

repeatedly · 2018-04-04T11:25:06Z

Release v1.1.3 with queued_chunks_limit_size parameter in <buffer>.
This problem should be resolved.

repeatedly self-assigned this Mar 19, 2018

repeatedly added the v1 label Mar 19, 2018

repeatedly mentioned this issue Mar 30, 2018

buffer: Add queued_chunks_limit_size to control the number of queued chunks #1916

Merged

repeatedly closed this as completed Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

breath-co2 commented Mar 19, 2018 •

edited

Loading

repeatedly commented Mar 19, 2018

breath-co2 commented Mar 26, 2018

repeatedly commented Mar 30, 2018

repeatedly commented Apr 4, 2018

v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

v1.1 have a serious problem, too many buffer files and 100% cpu usage when forward node has error in high concurrency log system #1904

Comments

breath-co2 commented Mar 19, 2018 • edited Loading

repeatedly commented Mar 19, 2018

breath-co2 commented Mar 26, 2018

repeatedly commented Mar 30, 2018

repeatedly commented Apr 4, 2018

breath-co2 commented Mar 19, 2018 •

edited

Loading