-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out_s3 memory usage since adding data ordering preservation feature #4038 #232
Comments
See upstream: fluent/fluent-bit#4038 |
Can confirm, just discovered yesterday too that the S3 output plugin is leaking memory. My workloads are running on ECS Fargate 1.4.0, same fluentbit distribution (2.19.0). |
@andreas-schroeder thanks for your reporting. could you please share more details on that? Like you config file. Then we could try to reproduce the problem. Also, we just cut a release on aws-for-fluent-bit 2.19.1. Could you please try to upgrade to the latest version to see if it still exists? Thanks |
Hi @psoaresgit, @andreas-schroeder I tried to reproduce the problem but I still couldn't find the leak on my side. Here is the output for Vagrind:
Could you please share with me the full config file and provide more detailed information about how you run it? It would be helpful for us to fix the issue. BTW, I use Fluent Bit 1.8.6 to run it which is the latest version. Could you please try to upgrade to the latest version first to see if the leak has been fixed? Thanks. |
Hi @zhonghui12 Sure, below are my details (quite similar to the ones I gave here: fluent/fluent-bit#4013 ). On the details, there is around 95 MiB/minute of logs distributed over all ECS tasks (so around 500 KiB/sec per ECS task), with an average size of 3000 bytes. I've lost quite some energy with trying to get the logs to S3, directly (which leaked both with compression enabled/disabled) and via Firehose (which segfaulted, possibly due to fluent/fluent-bit#3917 ), I currently don't have the time and can't muster up the energy to go for another round of experiments. Do you know which older version would be viable?
Parsers File
Streams File
|
@andreas-schroeder We have seen (same as that issue says) that the 1.7.x series seems to be stable/more stable for the kinesis_firehose plugin. We did find another issue which needed to be patched, which I backported to the 1.7.x series in a few images which any AWS user can pull:
Pull example:
These firehose plugin seems to be more stable in these images. |
Hi @PettitWesley , thanks for your input, I would rather prefer to go directly to S3 instead of going over Firehose if possible. What would be a version of aws-for-fluentbit for that? |
Version 2.19.1 is still leaking memory in my setup, checking 2.16.1. Currently, this one looks fine to me. |
Hi @psoaresgit and @andreas-schroeder, I tried to reproduce the issue on my side but I still couldn't find the leak. To be specific, I use the latest Fluent Bit binary and use Valgrind to debug it on an EC2 instance. Here are some details of my settings. My config file:
Part of my output file (after gunzip):
Also, here is the output of Valgrind. Couldn't see any leak:
|
@zhonghui12 I see, have you tried with log statements of 3000 bytes in size with a rate of 300/sec? That's roughly what is being processed in my setup. |
@andreas-schroeder , I tried with your settings with log statements of 3000 bytes in size with a rate of 300/sec. See below is my input config:
Then I got my output:
And my Valgrind output:
So it seems like Valgrind shows no memory leak in our code. In my opinion, it might not be a bug but it is something in the setup which makes it use high memory. However, high memory usage doesn't mean there is a memory leak. I am sorry but we could just help with a memory leak and smaller logs might help to reduce your memory usage. Also @psoaresgit, I used the |
@zhonghui12 a memory consumption increase from around 96 MiB up to 256 MiB (and exceeding it, since it gets OOMKilled by ECS) over the course of 1 - 2 hours doesn't look like high but stable memory usage to me. I understand that you couldn't reproduce the issue, maybe I will find time to run my setup with valgrind to see if I can give you details. |
Thanks @andreas-schroeder. Please let me know if you could find something to help us locate the leak. I would be happy to help with the fixing. |
@zhonghui12 here you go, I hope this helps. I see lots of issues from /usr/lib64/libcrypto.so.1.1.1g, but also some involving flb_malloc.
|
The output suggests that we have leaks in the go plugin system here:
At least for the leaks for which there is a clear call trace in the code. There's one |
This fixes the issue from where it was introduced and released in v1.8.3 to v1.8.6 (current): |
@psoaresgit Thank you so much; I will take a look at your patch. |
Thanks @andreas-schroeder for confirming you saw the same. |
Well maybe I'll leave this open until a new release of fluent-bit 1.8 is included in aws-for-fluent-bit? |
@psoaresgit Sounds good. Added a pending release label. |
The fix should be included in fluent bit 1.8.8. Will let you all know if we release an image based on fluent bit 1.8.8 |
The fix is included in our latest release: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.0. Will close this issue and feel free to reopen it if the problem still exists. Thanks. |
Bug Report
The commit (8cd50f8) via #3842 seems to occasion a runaway increase in memory usage.
Using
preserve_data_ordering false
does not prevent the memory usage increase.I've reverted 8cd50f8 manually, and, with HEAD at 2696f7c via #3805, I don't see the increase in memory usage.
Screenshots
The peaks/drops are the ECS task becoming unhealthy and a new one being run in the ECS service.
The later flat-line is with the commit in question reverted.
Your Environment
ECS
EC2
AL2
The text was updated successfully, but these errors were encountered: