Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage although filesystem storage is enabled #6082

Closed
drbugfinder-work opened this issue Sep 22, 2022 · 4 comments · Fixed by #6210
Closed

Excessive memory usage although filesystem storage is enabled #6082

drbugfinder-work opened this issue Sep 22, 2022 · 4 comments · Fixed by #6210

Comments

@drbugfinder-work
Copy link
Contributor

drbugfinder-work commented Sep 22, 2022

Bug Report

Describe the bug
We are trying to configure filesystem caching to limit the memory usage of Fluent Bit that it will not be OOM-killed in K8s environment.
However, we observe huge memory consumption over time, even with the (slightly changed) examples from the documentation (https://docs.fluentbit.io/manual/administration/buffering-and-storage#output-section-configuration):

To Reproduce

fluent-bit.conf:

[SERVICE]
    flush                     1
    log_Level                 info
    storage.path              /var/log/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 5M
    storage.max_chunks_up     128
[INPUT]
    Name Dummy
    Rate 5000
    storage.type  filesystem
[OUTPUT]
    Name forward
    Match *
    Retry_Limit  no_retries
    Tag fluentbit
    storage.total_limit_size 10G
## The output destination is not existent, so Fluent Bit should buffer (the latest 10GB) in filesystem
    Host not-existing-target.local
    Port 1234
    time_as_integer true

Expected behavior
We expect that the output chunks are being written to the filesystem (up to 10GB in this case) and the memory usage of Fluent Bit should be constant (anywhere below ~200MB).

Actual behavior
Memory consumption grows way over every expected limit (in this case 20,3GB RSS) after running the above configuration over one night. We cannot see any filesystem usage except from the dummy.0 input plugin, which always uses 500kb-2MB.

Screenshots
image
image
image

Your Environment

  • Version used: 1.9.8
  • Environment name and version (e.g. Kubernetes? What version?): Docker, Kubernetes (containerd)
  • Operating System and version: Ubuntu 22.04

Additional context

@patrick-stephens
Copy link
Contributor

Do you need to configure the memory limits as well as the filesystem ones? From: https://docs.fluentbit.io/manual/administration/buffering-and-storage#buffering-and-memory

In a high load environment with backpressure the risks of having high memory usage is the chance of getting killed by the Kernel (OOM Killer). A workaround for this backpressure scenario is to limit the amount of memory in records that an input plugin can register, this configuration property is called mem_buf_limit: if a plugin has enqueued more than the mem_buf_limit, it won't be able to ingest more until that data can be delivered or flushed properly. In this scenario the input plugin in question is paused.

And then later on, I think this is what you're after:

If the input plugin has enabled mem_buf_limit and storage.type as filesystem, when reaching the mem_buf_limit threshold, instead of the plugin being paused, all new data will go to Chunks that are down in the filesystem. This allows to control the memory usage by the service but also providing a a guarantee that the service won't lose any data.

The docs can always be improved I think so worth a docs PR with any suggestions here: https://github.com/fluent/fluent-bit-docs/blob/master/administration/buffering-and-storage.md

@mcauto
Copy link

mcauto commented Sep 26, 2022

Do you need to configure the memory limits as well as the filesystem ones? From: https://docs.fluentbit.io/manual/administration/buffering-and-storage#buffering-and-memory

In a high load environment with backpressure the risks of having high memory usage is the chance of getting killed by the Kernel (OOM Killer). A workaround for this backpressure scenario is to limit the amount of memory in records that an input plugin can register, this configuration property is called mem_buf_limit: if a plugin has enqueued more than the mem_buf_limit, it won't be able to ingest more until that data can be delivered or flushed properly. In this scenario the input plugin in question is paused.

And then later on, I think this is what you're after:

If the input plugin has enabled mem_buf_limit and storage.type as filesystem, when reaching the mem_buf_limit threshold, instead of the plugin being paused, all new data will go to Chunks that are down in the filesystem. This allows to control the memory usage by the service but also providing a a guarantee that the service won't lose any data.

The docs can always be improved I think so worth a docs PR with any suggestions here: https://github.com/fluent/fluent-bit-docs/blob/master/administration/buffering-and-storage.md

@patrick-stephens Did you read and answer the config file created above?

@drbugfinder-work
Copy link
Contributor Author

drbugfinder-work commented Oct 14, 2022

We found out, that this issue is caused by the last line time_as_integer true.
To us it looks like it is the same root cause as in #6208

@nokute78
Copy link
Collaborator

@drbugfinder-work Thank you for information.
I can reproduce it.

$ valgrind --leak-check=full bin/fluent-bit -i dummy -o forward -p time_as_integer=on
==40739== Memcheck, a memory error detector
==40739== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==40739== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==40739== Command: bin/fluent-bit -i dummy -o forward -p time_as_integer=on
==40739== 
[error] setting up section 'output' plugin property 'time_as_integer'
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/10/15 07:48:45] [ info] [fluent bit] version=2.0.0, commit=095d090aeb, pid=40739
[2022/10/15 07:48:45] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2022/10/15 07:48:45] [ info] [cmetrics] version=0.5.2
[2022/10/15 07:48:45] [ info] [input:dummy:dummy.0] initializing
[2022/10/15 07:48:45] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2022/10/15 07:48:46] [ info] [output:forward:forward.0] worker #0 started
[2022/10/15 07:48:46] [ info] [output:forward:forward.0] worker #1 started
[2022/10/15 07:48:46] [ info] [sp] stream processor started
==40739== Warning: client switching stacks?  SP change: 0x6ffb8b8 --> 0x549e9b0
==40739==          to suppress, use: --max-stackframe=28692232 or greater
==40739== Warning: client switching stacks?  SP change: 0x549e908 --> 0x6ffb8b8
==40739==          to suppress, use: --max-stackframe=28692400 or greater
==40739== Warning: client switching stacks?  SP change: 0x6ffb8b8 --> 0x549e908
==40739==          to suppress, use: --max-stackframe=28692400 or greater
==40739==          further instances of this message will not be shown.
[2022/10/15 07:48:47] [error] [output:forward:forward.0] no upstream connections available
[2022/10/15 07:48:47] [ warn] [engine] failed to flush chunk '40739-1665787726.199139059.flb', retry in 10 seconds: task_id=0, input=dummy.0 > output=forward.0 (out_id=0)
^C[2022/10/15 07:48:48] [engine] caught signal (SIGINT)
[2022/10/15 07:48:48] [error] [output:forward:forward.0] no upstream connections available
[2022/10/15 07:48:48] [ warn] [engine] service will shutdown in max 5 seconds
[2022/10/15 07:48:48] [ info] [input] pausing dummy.0
[2022/10/15 07:48:48] [error] [output:forward:forward.0] no upstream connections available
[2022/10/15 07:48:48] [ warn] [engine] failed to flush chunk '40739-1665787727.195758556.flb', retry in 1 seconds: task_id=1, input=dummy.0 > output=forward.0 (out_id=0)
[2022/10/15 07:48:48] [ warn] [engine] chunk '40739-1665787726.199139059.flb' cannot be retried: task_id=0, input=dummy.0 > output=forward.0
[2022/10/15 07:48:48] [ info] [task] dummy/dummy.0 has 1 pending task(s):
[2022/10/15 07:48:48] [ info] [task]   task_id=1 still running on route(s): forward/forward.0 
[2022/10/15 07:48:48] [error] [output:forward:forward.0] no upstream connections available
[2022/10/15 07:48:48] [ info] [input] pausing dummy.0
[2022/10/15 07:48:48] [ warn] [engine] chunk '40739-1665787727.195758556.flb' cannot be retried: task_id=1, input=dummy.0 > output=forward.0
[2022/10/15 07:48:49] [ info] [engine] service has stopped (0 pending tasks)
[2022/10/15 07:48:49] [ info] [input] pausing dummy.0
[2022/10/15 07:48:49] [ info] [output:forward:forward.0] thread worker #0 stopping...
[2022/10/15 07:48:49] [ info] [output:forward:forward.0] thread worker #0 stopped
[2022/10/15 07:48:49] [ info] [output:forward:forward.0] thread worker #1 stopping...
[2022/10/15 07:48:49] [ info] [output:forward:forward.0] thread worker #1 stopped
==40739== 
==40739== HEAP SUMMARY:
==40739==     in use at exit: 32,768 bytes in 4 blocks
==40739==   total heap usage: 1,629 allocs, 1,625 frees, 1,413,206 bytes allocated
==40739== 
==40739== 32,768 bytes in 4 blocks are definitely lost in loss record 1 of 1
==40739==    at 0x48487A9: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==40739==    by 0x4DE98A: msgpack_sbuffer_write (sbuffer.h:81)
==40739==    by 0x4DE5C0: msgpack_pack_array (pack_template.h:724)
==40739==    by 0x4DF61E: flb_forward_format_forward_compat_mode (forward_format.c:355)
==40739==    by 0x4DFA71: flb_forward_format (forward_format.c:491)
==40739==    by 0x4DA3D7: cb_forward_flush (forward.c:1293)
==40739==    by 0x1D7CC8: output_pre_cb_flush (flb_output.h:519)
==40739==    by 0xA7588A: co_init (amd64.c:117)
==40739== 
==40739== LEAK SUMMARY:
==40739==    definitely lost: 32,768 bytes in 4 blocks
==40739==    indirectly lost: 0 bytes in 0 blocks
==40739==      possibly lost: 0 bytes in 0 blocks
==40739==    still reachable: 0 bytes in 0 blocks
==40739==         suppressed: 0 bytes in 0 blocks
==40739== 
==40739== For lists of detected and suppressed errors, rerun with: -s
==40739== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

nokute78 added a commit to nokute78/fluent-bit that referenced this issue Oct 14, 2022
edsiper pushed a commit that referenced this issue Oct 15, 2022
nokute78 added a commit to nokute78/fluent-bit that referenced this issue Oct 15, 2022
edsiper pushed a commit that referenced this issue Oct 15, 2022
mgeriesa pushed a commit to mgeriesa/fluent-bit that referenced this issue Oct 25, 2022
…eger is true (fluent#6082)

Signed-off-by: Takahiro Yamashita <[email protected]>
Signed-off-by: Manal Geries <[email protected]>
sumitd2 pushed a commit to sumitd2/fluent-bit that referenced this issue Feb 8, 2023
…eger is true (fluent#6082)

Signed-off-by: Takahiro Yamashita <[email protected]>
Signed-off-by: root <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants