out_s3 memory usage since adding data ordering preservation feature #4038 #232

psoaresgit · 2021-09-01T20:45:43Z

Bug Report

The commit (8cd50f8) via #3842 seems to occasion a runaway increase in memory usage.

Using preserve_data_ordering false does not prevent the memory usage increase.

I've reverted 8cd50f8 manually, and, with HEAD at 2696f7c via #3805, I don't see the increase in memory usage.

Screenshots

The peaks/drops are the ECS task becoming unhealthy and a new one being run in the ECS service.

The later flat-line is with the commit in question reverted.

Your Environment

Version used: 1.8.3 via aws-for-fluent-bit 2.19.0
Configuration:

 [OUTPUT]
    Name            s3
    Match           *
    bucket          ${BUCKET}
    region          ${AWS_REGION}
    s3_key_format   /${PREFIX}%Y/%m/%d/%H/$UUID.gz
    total_file_size 1M
    upload_timeout  1m
    use_put_object  true
    compression     gzip
    log_key         log

Environment name and version:
ECS
Server type and version:
EC2
Operating System and version:
AL2

The text was updated successfully, but these errors were encountered:

psoaresgit · 2021-09-01T20:45:57Z

See upstream: fluent/fluent-bit#4038

andreas-schroeder · 2021-09-01T21:03:24Z

Can confirm, just discovered yesterday too that the S3 output plugin is leaking memory. My workloads are running on ECS Fargate 1.4.0, same fluentbit distribution (2.19.0).

zhonghui12 · 2021-09-01T21:10:03Z

@andreas-schroeder thanks for your reporting. could you please share more details on that? Like you config file. Then we could try to reproduce the problem. Also, we just cut a release on aws-for-fluent-bit 2.19.1. Could you please try to upgrade to the latest version to see if it still exists? Thanks

zhonghui12 · 2021-09-02T04:52:08Z

Hi @psoaresgit, @andreas-schroeder

I tried to reproduce the problem but I still couldn't find the leak on my side. Here is the output for Vagrind:

==20719== LEAK SUMMARY:
==20719==    definitely lost: 0 bytes in 0 blocks
==20719==    indirectly lost: 0 bytes in 0 blocks
==20719==      possibly lost: 0 bytes in 0 blocks
==20719==    still reachable: 102,240 bytes in 3,428 blocks
==20719==         suppressed: 0 bytes in 0 blocks

Could you please share with me the full config file and provide more detailed information about how you run it? It would be helpful for us to fix the issue.

BTW, I use Fluent Bit 1.8.6 to run it which is the latest version. Could you please try to upgrade to the latest version first to see if the leak has been fixed? Thanks.

andreas-schroeder · 2021-09-02T07:17:03Z

Hi @zhonghui12

Sure, below are my details (quite similar to the ones I gave here: fluent/fluent-bit#4013 ). On the details, there is around 95 MiB/minute of logs distributed over all ECS tasks (so around 500 KiB/sec per ECS task), with an average size of 3000 bytes.

I've lost quite some energy with trying to get the logs to S3, directly (which leaked both with compression enabled/disabled) and via Firehose (which segfaulted, possibly due to fluent/fluent-bit#3917 ), I currently don't have the time and can't muster up the energy to go for another round of experiments. Do you know which older version would be viable?

Version used: 1.8.3 (aws-for-fluentbit distro 2.19.0)
Configuration:

[SERVICE]
    Parsers_File /fluent-bit-conf/parser.conf
    Streams_File /fluent-bit-conf/stream-processing.conf

[FILTER]
    Name          parser
    Match         *
    Key_Name      log
    Parser        json
    Reserve_Data  True

[OUTPUT]
    Name              cloudwatch
    Match             logs.default
    region            eu-central-1
    log_group_name    <default-log-group>
    log_stream_name   <default-log-stream>

[OUTPUT]
    Name              cloudwatch
    Match             logs.security
    region            eu-central-1
    log_group_name    <security-log-group>
    log_stream_name   <security-log-stream>

[OUTPUT]
    Name              s3
    Match             logs.trace
    region            eu-central-1
    bucket            <s3-bucket-name>
    s3_key_format     /created_at=%Y-%m-%d-%H/trace-%Y-%m-%d-%H-%M-%S-$UUID.log
    total_file_size   50M
    upload_timeout    1m
    use_put_object    On

Parsers File

[PARSER]
    Name   json
    Format json

Streams File

[STREAM_TASK]
    Name   default_logs
    Exec   CREATE STREAM default WITH (tag='logs.default') AS SELECT * from TAG:'app*' WHERE trace != true;

[STREAM_TASK]
    Name   security_logs
    Exec   CREATE STREAM security WITH (tag='logs.security') AS SELECT * from TAG:'app*' WHERE security = true;

[STREAM_TASK]
    Name   trace_logs
    Exec   CREATE STREAM trace WITH (tag='logs.trace') AS SELECT * from TAG:'app*' WHERE trace = true;

Environment name and version: ECS Fargate 1.4.0

Here is a graph of the memory consumption:

PettitWesley · 2021-09-02T07:28:24Z

(which segfaulted, possibly due to fluent/fluent-bit#3917 ), I currently don't have the time and can't muster up the energy to go for another round of experiments. Do you know which older version would be viable?

@andreas-schroeder We have seen (same as that issue says) that the 1.7.x series seems to be stable/more stable for the kinesis_firehose plugin.

We did find another issue which needed to be patched, which I backported to the 1.7.x series in a few images which any AWS user can pull:

144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.5
144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.9

Pull example:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/http-buffer-patch:1.7.9

These firehose plugin seems to be more stable in these images.

andreas-schroeder · 2021-09-02T08:07:11Z

Hi @PettitWesley , thanks for your input, I would rather prefer to go directly to S3 instead of going over Firehose if possible. What would be a version of aws-for-fluentbit for that?

andreas-schroeder · 2021-09-02T16:03:47Z

Version 2.19.1 is still leaking memory in my setup, checking 2.16.1. Currently, this one looks fine to me.

zhonghui12 · 2021-09-02T21:33:09Z

Hi @psoaresgit and @andreas-schroeder, I tried to reproduce the issue on my side but I still couldn't find the leak. To be specific, I use the latest Fluent Bit binary and use Valgrind to debug it on an EC2 instance. Here are some details of my settings.

My config file:


[SERVICE]
    Log_Level debug
[INPUT]
    Name        dummy
    Tag         dummy.data
    Dummy       {"log": {"test1":"value1", "test2": "value2"}}
[OUTPUT]
    Name s3
    Match *
    bucket <my-s3-bucket>
    region us-east-1
    s3_key_format /fluent-bit-logs/%Y/%m/%d/%H/$UUID.gz
    total_file_size 1M
    upload_timeout 1m
    use_put_object On
    compression gzip
    log_key log
    preserve_data_ordering true

Part of my output file (after gunzip):

{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}
{"test1":"value1","test2":"value2"}

Also, here is the output of Valgrind. Couldn't see any leak:

==26900== LEAK SUMMARY:
==26900==    definitely lost: 0 bytes in 0 blocks
==26900==    indirectly lost: 0 bytes in 0 blocks
==26900==      possibly lost: 0 bytes in 0 blocks
==26900==    still reachable: 102,240 bytes in 3,428 blocks
==26900==         suppressed: 0 bytes in 0 blocks

andreas-schroeder · 2021-09-02T21:52:50Z

@zhonghui12 I see, have you tried with log statements of 3000 bytes in size with a rate of 300/sec? That's roughly what is being processed in my setup.

zhonghui12 · 2021-09-02T22:49:06Z

@andreas-schroeder , I tried with your settings with log statements of 3000 bytes in size with a rate of 300/sec. See below is my input config:

[INPUT]
    Name        dummy
    Tag         dummy.data
    Dummy      <a 3kb json string>
    Rate        300

Then I got my output:

{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}
{"test1":"value1","test2":"value2","abc":"abc"}

And my Valgrind output:

==28136== LEAK SUMMARY:
==28136==    definitely lost: 0 bytes in 0 blocks
==28136==    indirectly lost: 0 bytes in 0 blocks
==28136==      possibly lost: 0 bytes in 0 blocks
==28136==    still reachable: 102,240 bytes in 3,428 blocks
==28136==         suppressed: 0 bytes in 0 blocks
==28136== Rerun with --leak-check=full to see details of leaked memory

So it seems like Valgrind shows no memory leak in our code. In my opinion, it might not be a bug but it is something in the setup which makes it use high memory. However, high memory usage doesn't mean there is a memory leak. I am sorry but we could just help with a memory leak and smaller logs might help to reduce your memory usage.

Also @psoaresgit, I used the preserve_data_ordering setting and couldn't find a memory leak either. It might be something in the settings which use more memory usage. Please let me know if you have more questions about this issue.

andreas-schroeder · 2021-09-03T06:10:39Z

@zhonghui12 a memory consumption increase from around 96 MiB up to 256 MiB (and exceeding it, since it gets OOMKilled by ECS) over the course of 1 - 2 hours doesn't look like high but stable memory usage to me. I understand that you couldn't reproduce the issue, maybe I will find time to run my setup with valgrind to see if I can give you details.

zhonghui12 · 2021-09-03T06:39:15Z

Thanks @andreas-schroeder. Please let me know if you could find something to help us locate the leak. I would be happy to help with the fixing.

andreas-schroeder · 2021-09-03T14:36:47Z

@zhonghui12 here you go, I hope this helps. I see lots of issues from /usr/lib64/libcrypto.so.1.1.1g, but also some involving flb_malloc.

dump.log

LEAK SUMMARY:
   definitely lost: 3,732,708 bytes in 13 blocks
   indirectly lost: 569 bytes in 17 blocks
     possibly lost: 9,558,092 bytes in 25 blocks
   still reachable: 113,788 bytes in 3,774 blocks
        suppressed: 0 bytes in 0 blocks

PettitWesley · 2021-09-12T01:14:45Z

The output suggests that we have leaks in the go plugin system here:

At least for the leaks for which there is a clear call trace in the code. There's one definitely lost warning for which there isn't enough info to tell what caused it.

psoaresgit · 2021-09-14T15:47:39Z

This fixes the issue from where it was introduced and released in v1.8.3 to v1.8.6 (current):
fluent/fluent-bit#4091

PettitWesley · 2021-09-17T20:18:01Z

@psoaresgit Thank you so much; I will take a look at your patch.

psoaresgit · 2021-09-21T23:13:52Z

Thanks @andreas-schroeder for confirming you saw the same.
Thanks @zhonghui12 for putting eyes on this.
Thanks @PettitWesley for reviewing and merging the fix upstream.
Closing this.

psoaresgit · 2021-09-21T23:14:26Z

Well maybe I'll leave this open until a new release of fluent-bit 1.8 is included in aws-for-fluent-bit?

PettitWesley · 2021-09-21T23:32:26Z

@psoaresgit Sounds good. Added a pending release label.

zhonghui12 · 2021-10-01T01:59:51Z

The fix should be included in fluent bit 1.8.8. Will let you all know if we release an image based on fluent bit 1.8.8

zhonghui12 · 2021-10-20T17:58:20Z

The fix is included in our latest release: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.0. Will close this issue and feel free to reopen it if the problem still exists. Thanks.

psoaresgit mentioned this issue Sep 1, 2021

out_s3 memory usage since adding data ordering preservation feature fluent/fluent-bit#4038

Closed

psoaresgit closed this as completed Sep 21, 2021

psoaresgit reopened this Sep 21, 2021

PettitWesley added the pending release label Sep 21, 2021

zhonghui12 closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_s3 memory usage since adding data ordering preservation feature #4038 #232

out_s3 memory usage since adding data ordering preservation feature #4038 #232

psoaresgit commented Sep 1, 2021

psoaresgit commented Sep 1, 2021

andreas-schroeder commented Sep 1, 2021 •

edited

Loading

zhonghui12 commented Sep 1, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

PettitWesley commented Sep 2, 2021 •

edited

Loading

andreas-schroeder commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 3, 2021

zhonghui12 commented Sep 3, 2021

andreas-schroeder commented Sep 3, 2021 •

edited

Loading

PettitWesley commented Sep 12, 2021

psoaresgit commented Sep 14, 2021 •

edited

Loading

PettitWesley commented Sep 17, 2021

psoaresgit commented Sep 21, 2021

psoaresgit commented Sep 21, 2021

PettitWesley commented Sep 21, 2021

zhonghui12 commented Oct 1, 2021 •

edited

Loading

zhonghui12 commented Oct 20, 2021

out_s3 memory usage since adding data ordering preservation feature #4038 #232

out_s3 memory usage since adding data ordering preservation feature #4038 #232

Comments

psoaresgit commented Sep 1, 2021

Bug Report

psoaresgit commented Sep 1, 2021

andreas-schroeder commented Sep 1, 2021 • edited Loading

zhonghui12 commented Sep 1, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

PettitWesley commented Sep 2, 2021 • edited Loading

andreas-schroeder commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 2, 2021

zhonghui12 commented Sep 2, 2021

andreas-schroeder commented Sep 3, 2021

zhonghui12 commented Sep 3, 2021

andreas-schroeder commented Sep 3, 2021 • edited Loading

PettitWesley commented Sep 12, 2021

psoaresgit commented Sep 14, 2021 • edited Loading

PettitWesley commented Sep 17, 2021

psoaresgit commented Sep 21, 2021

psoaresgit commented Sep 21, 2021

PettitWesley commented Sep 21, 2021

zhonghui12 commented Oct 1, 2021 • edited Loading

zhonghui12 commented Oct 20, 2021

andreas-schroeder commented Sep 1, 2021 •

edited

Loading

PettitWesley commented Sep 2, 2021 •

edited

Loading

andreas-schroeder commented Sep 3, 2021 •

edited

Loading

psoaresgit commented Sep 14, 2021 •

edited

Loading

zhonghui12 commented Oct 1, 2021 •

edited

Loading