Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3 memory usage since adding data ordering preservation feature #4038 #232

Closed
psoaresgit opened this issue Sep 1, 2021 · 22 comments
Closed

Comments

@psoaresgit
Copy link

Bug Report

The commit (8cd50f8) via #3842 seems to occasion a runaway increase in memory usage.

Using preserve_data_ordering false does not prevent the memory usage increase.

I've reverted 8cd50f8 manually, and, with HEAD at 2696f7c via #3805, I don't see the increase in memory usage.

Screenshots
ECS Memory Utilization metrics screenshot

The peaks/drops are the ECS task becoming unhealthy and a new one being run in the ECS service.

The later flat-line is with the commit in question reverted.

Your Environment

  • Version used: 1.8.3 via aws-for-fluent-bit 2.19.0
  • Configuration:
 [OUTPUT]
    Name            s3
    Match           *
    bucket          ${BUCKET}
    region          ${AWS_REGION}
    s3_key_format   /${PREFIX}%Y/%m/%d/%H/$UUID.gz
    total_file_size 1M
    upload_timeout  1m
    use_put_object  true
    compression     gzip
    log_key         log
  • Environment name and version:
    ECS
  • Server type and version:
    EC2
  • Operating System and version:
    AL2
@psoaresgit
Copy link
Author

See upstream: fluent/fluent-bit#4038

@andreas-schroeder
Copy link

andreas-schroeder commented Sep 1, 2021

Can confirm, just discovered yesterday too that the S3 output plugin is leaking memory. My workloads are running on ECS Fargate 1.4.0, same fluentbit distribution (2.19.0).

@zhonghui12
Copy link
Contributor

@andreas-schroeder thanks for your reporting. could you please share more details on that? Like you config file. Then we could try to reproduce the problem. Also, we just cut a release on aws-for-fluent-bit 2.19.1. Could you please try to upgrade to the latest version to see if it still exists? Thanks

@zhonghui12
Copy link
Contributor

Hi @psoaresgit, @andreas-schroeder

I tried to reproduce the problem but I still couldn't find the leak on my side. Here is the output for Vagrind:

==20719== LEAK SUMMARY:
==20719==    definitely lost: 0 bytes in 0 blocks
==20719==    indirectly lost: 0 bytes in 0 blocks
==20719==      possibly lost: 0 bytes in 0 blocks
==20719==    still reachable: 102,240 bytes in 3,428 blocks
==20719==         suppressed: 0 bytes in 0 blocks

Could you please share with me the full config file and provide more detailed information about how you run it? It would be helpful for us to fix the issue.

BTW, I use Fluent Bit 1.8.6 to run it which is the latest version. Could you please try to upgrade to the latest version first to see if the leak has been fixed? Thanks.

@andreas-schroeder
Copy link

Hi @zhonghui12

Sure, below are my details (quite similar to the ones I gave here: fluent/fluent-bit#4013 ). On the details, there is around 95 MiB/minute of logs distributed over all ECS tasks (so around 500 KiB/sec per ECS task), with an average size of 3000 bytes.

I've lost quite some energy with trying to get the logs to S3, directly (which leaked both with compression enabled/disabled) and via Firehose (which segfaulted, possibly due to fluent/fluent-bit#3917 ), I currently don't have the time and can't muster up the energy to go for another round of experiments. Do you know which older version would be viable?

  • Version used: 1.8.3 (aws-for-fluentbit distro 2.19.0)
  • Configuration:
[SERVICE]
    Parsers_File /fluent-bit-conf/parser.conf
    Streams_File /fluent-bit-conf/stream-processing.conf

[FILTER]
    Name          parser
    Match         *
    Key_Name      log
    Parser        json
    Reserve_Data  True

[OUTPUT]
    Name              cloudwatch
    Match             logs.default
    region            eu-central-1
    log_group_name    <default-log-group>
    log_stream_name   <default-log-stream>

[OUTPUT]
    Name              cloudwatch
    Match             logs.security
    region            eu-central-1
    log_group_name    <security-log-group>
    log_stream_name   <security-log-stream>

[OUTPUT]
    Name              s3
    Match             logs.trace
    region            eu-central-1
    bucket            <s3-bucket-name>
    s3_key_format     /created_at=%Y-%m-%d-%H/trace-%Y-%m-%d-%H-%M-%S-$UUID.log
    total_file_size   50M
    upload_timeout    1m
    use_put_object    On

Parsers File

[PARSER]
    Name   json
    Format json

Streams File

[STREAM_TASK]
    Name   default_logs
    Exec   CREATE STREAM default WITH (tag='logs.default') AS SELECT * from TAG:'app*' WHERE trace != true;

[STREAM_TASK]
    Name   security_logs
    Exec   CREATE STREAM security WITH (tag='logs.security') AS SELECT * from TAG:'app*' WHERE security = true;

[STREAM_TASK]
    Name   trace_logs
    Exec   CREATE STREAM trace WITH (tag='logs.trace') AS SELECT * from TAG:'app*' WHERE trace = true;
  • Environment name and version: ECS Fargate 1.4.0

Here is a graph of the memory consumption:
fluentbit-memory-leak

@PettitWesley
Copy link
Contributor

PettitWesley commented Sep 2, 2021

(which segfaulted, possibly due to fluent/fluent-bit#3917 ), I currently don't have the time and can't muster up the energy to go for another round of experiments. Do you know which older version would be viable?

@andreas-schroeder We have seen (same as that issue says) that the 1.7.x series seems to be stable/more stable for the kinesis_firehose plugin.

We did find another issue which needed to be patched, which I backported to the 1.7.x series in a few images which any AWS user can pull:

144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.5
144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.9

Pull example:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.9

These firehose plugin seems to be more stable in these images.

@andreas-schroeder
Copy link

Hi @PettitWesley , thanks for your input, I would rather prefer to go directly to S3 instead of going over Firehose if possible. What would be a version of aws-for-fluentbit for that?

@andreas-schroeder
Copy link

Version 2.19.1 is still leaking memory in my setup, checking 2.16.1. Currently, this one looks fine to me.

@zhonghui12
Copy link
Contributor

Hi @psoaresgit and @andreas-schroeder, I tried to reproduce the issue on my side but I still couldn't find the leak. To be specific, I use the latest Fluent Bit binary and use Valgrind to debug it on an EC2 instance. Here are some details of my settings.

My config file:


[SERVICE]
    Log_Level debug
[INPUT]
    Name        dummy
    Tag         dummy.data
    Dummy       {"log": {"test1":"value1", "test2": "value2"}}
[OUTPUT]
    Name s3
    Match *
    bucket <my-s3-bucket>
    region us-east-1
    s3_key_format /fluent-bit-logs/%Y/%m/%d/%H/$UUID.gz
    total_file_size 1M
    upload_timeout 1m
    use_put_object On
    compression gzip
    log_key log
    preserve_data_ordering true

Part of my output file (after gunzip):

{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}

Also, here is the output of Valgrind. Couldn't see any leak:

==26900== LEAK SUMMARY:
==26900==    definitely lost: 0 bytes in 0 blocks
==26900==    indirectly lost: 0 bytes in 0 blocks
==26900==      possibly lost: 0 bytes in 0 blocks
==26900==    still reachable: 102,240 bytes in 3,428 blocks
==26900==         suppressed: 0 bytes in 0 blocks

@andreas-schroeder
Copy link

@zhonghui12 I see, have you tried with log statements of 3000 bytes in size with a rate of 300/sec? That's roughly what is being processed in my setup.

@zhonghui12
Copy link
Contributor

@andreas-schroeder , I tried with your settings with log statements of 3000 bytes in size with a rate of 300/sec. See below is my input config:

[INPUT]
    Name        dummy
    Tag         dummy.data
    Dummy      <a 3kb json string>
    Rate        300

Then I got my output:

{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}

And my Valgrind output:

==28136== LEAK SUMMARY:
==28136==    definitely lost: 0 bytes in 0 blocks
==28136==    indirectly lost: 0 bytes in 0 blocks
==28136==      possibly lost: 0 bytes in 0 blocks
==28136==    still reachable: 102,240 bytes in 3,428 blocks
==28136==         suppressed: 0 bytes in 0 blocks
==28136== Rerun with --leak-check=full to see details of leaked memory

So it seems like Valgrind shows no memory leak in our code. In my opinion, it might not be a bug but it is something in the setup which makes it use high memory. However, high memory usage doesn't mean there is a memory leak. I am sorry but we could just help with a memory leak and smaller logs might help to reduce your memory usage.

Also @psoaresgit, I used the preserve_data_ordering setting and couldn't find a memory leak either. It might be something in the settings which use more memory usage. Please let me know if you have more questions about this issue.

@andreas-schroeder
Copy link

@zhonghui12 a memory consumption increase from around 96 MiB up to 256 MiB (and exceeding it, since it gets OOMKilled by ECS) over the course of 1 - 2 hours doesn't look like high but stable memory usage to me. I understand that you couldn't reproduce the issue, maybe I will find time to run my setup with valgrind to see if I can give you details.

@zhonghui12
Copy link
Contributor

Thanks @andreas-schroeder. Please let me know if you could find something to help us locate the leak. I would be happy to help with the fixing.

@andreas-schroeder
Copy link

andreas-schroeder commented Sep 3, 2021

@zhonghui12 here you go, I hope this helps. I see lots of issues from /usr/lib64/libcrypto.so.1.1.1g, but also some involving flb_malloc.

dump.log

LEAK SUMMARY:
   definitely lost: 3,732,708 bytes in 13 blocks
   indirectly lost: 569 bytes in 17 blocks
     possibly lost: 9,558,092 bytes in 25 blocks
   still reachable: 113,788 bytes in 3,774 blocks
        suppressed: 0 bytes in 0 blocks

@PettitWesley
Copy link
Contributor

The output suggests that we have leaks in the go plugin system here:

At least for the leaks for which there is a clear call trace in the code. There's one definitely lost warning for which there isn't enough info to tell what caused it.

@psoaresgit
Copy link
Author

psoaresgit commented Sep 14, 2021

This fixes the issue from where it was introduced and released in v1.8.3 to v1.8.6 (current):
fluent/fluent-bit#4091

@PettitWesley
Copy link
Contributor

@psoaresgit Thank you so much; I will take a look at your patch.

@psoaresgit
Copy link
Author

Thanks @andreas-schroeder for confirming you saw the same.
Thanks @zhonghui12 for putting eyes on this.
Thanks @PettitWesley for reviewing and merging the fix upstream.
Closing this.

@psoaresgit
Copy link
Author

Well maybe I'll leave this open until a new release of fluent-bit 1.8 is included in aws-for-fluent-bit?

@PettitWesley
Copy link
Contributor

@psoaresgit Sounds good. Added a pending release label.

@zhonghui12
Copy link
Contributor

zhonghui12 commented Oct 1, 2021

The fix should be included in fluent bit 1.8.8. Will let you all know if we release an image based on fluent bit 1.8.8

@zhonghui12
Copy link
Contributor

The fix is included in our latest release: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.0. Will close this issue and feel free to reopen it if the problem still exists. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants