Starting `v1.8.7` agent consumes all resource on node, when kinesis_stream output throttles. #4192

schoi-godaddy · 2021-10-14T22:14:13Z

Bug Report

Describe the bug

td-agent-bit v1.8.8 with kinesis_streams output starts consuming near 100% CPU and Memory when kinesis stream starts throttling due to shard it is configured with. Then it eventually gets killed by oom_reaper and repeats this cycle over and over while consuming CPU/Memory of the node. The main problem is, this never happened prior to v1.8.7.

To Reproduce

Rubular link if applicable:
Example log message if applicable:

# /var/log/message

Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: #033[1mFluent Bit v1.8.8#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * https://fluentbit.io
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [engine] started (pid=7690)
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] version=1.1.4, initializing...
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] in-memory
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [cmetrics] version=0.2.2
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [sp] stream processor started
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] 94 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [ warn] [engine] failed to flush chunk '7690-1634245512.788940221.flb', retry in 6 seconds: task_id=0, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] 97 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [ warn] [engine] failed to flush chunk '7690-1634245534.977337015.flb', retry in 10 seconds: task_id=1, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] 33 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [ warn] [engine] chunk '7690-1634245512.788940221.flb' cannot be retried: task_id=0, input=dummy.0 > output=kinesis_streams.0
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] 96 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [ warn] [engine] failed to flush chunk '7690-1634245514.968038432.flb', retry in 8 seconds: task_id=0, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] 96 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [ warn] [engine] chunk '7690-1634245534.977337015.flb' cannot be retried: task_id=1, input=dummy.0 > output=kinesis_streams.0
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] 400 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [ warn] [engine] failed to flush chunk '7690-1634245512.968040099.flb', retry in 8 seconds: task_id=2, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:44 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:44] [error] [output:kinesis_streams:kinesis_streams.0] 338 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
...
Oct 14 21:10:12 ip-10-10-10-10 kernel: Out of memory: Kill process 7690 (td-agent-bit) score 820 or sacrifice child
Oct 14 21:10:12 ip-10-10-10-10 kernel: Killed process 7690 (td-agent-bit) total-vm:1761932kB, anon-rss:850232kB, file-rss:0kB, shmem-rss:0kB
Oct 14 21:10:12 ip-10-10-10-10 kernel: oom_reaper: reaped process 7690 (td-agent-bit), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Oct 14 21:10:12 ip-10-10-10-10 systemd: td-agent-bit.service: main process exited, code=killed, status=9/KILL
Oct 14 21:10:12 ip-10-10-10-10 systemd: Unit td-agent-bit.service entered failed state.
Oct 14 21:10:12 ip-10-10-10-10 systemd: td-agent-bit.service failed.
Oct 14 21:10:13 ip-10-10-10-10 systemd: td-agent-bit.service holdoff time over, scheduling restart.
Oct 14 21:10:13 ip-10-10-10-10 systemd: Stopped TD Agent Bit.
Oct 14 21:10:13 ip-10-10-10-10 systemd: Started TD Agent Bit.
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: #033[1mFluent Bit v1.8.8#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * https://fluentbit.io
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [engine] started (pid=7710)
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] version=1.1.4, initializing...
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] in-memory
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [cmetrics] version=0.2.2
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [sp] stream processor started

sh-4.2$ top

top - 21:10:10 up 39 min,  0 users,  load average: 0.28, 0.07, 0.02
Tasks:  85 total,   1 running,  48 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.0 us,  2.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1006888 total,    67624 free,   724416 used,   214848 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   143248 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                
 7690 root      20   0  838284 629676   8856 S 99.0 62.5   0:29.70 td-agent-bit                                                                                                           
  ...

Steps to reproduce the problem:

Start EC2 instance with regular amazonlinux v2 AMI.
Create an AWS Kinesis stream, remember region/name of the kinesis stream. Make sure to set the shard to 1 to simulate 1000 counts/sec limit.
Add td agent yum repo to /etc/yum.repos.d/td.repo.

[td-agent-bit]
name=TD Agent Bit
baseurl=https://packages.fluentbit.io/amazonlinux/2/$basearch/
gpgcheck=1
gpgkey=https://packages.fluentbit.io/fluentbit.key
enabled=1

Install td-agent-bit

$ yum install td-agent-bit -y

Replace td agent bit config (/etc/td-agent-bit/td-agent-bit.conf) with following: replace region and stream field with your kinesis stream region/name

[INPUT]
    Name   dummy
    Dummy  {"message":"dummy"}
    Rate   1000
    Tag    loadtest

[OUTPUT]
    Name  kinesis_streams
    Match loadtest
    region us-west-2
    stream fluent-bit-1-8-8-test
    auto_retry_requests false

Replace systemd td agent bit service config (/usr/lib/systemd/system/td-agent-bit.service)

[Unit]
Description=TD Agent Bit
Requires=network.target
After=network.target

[Service]
EnvironmentFile=/etc/environment
Environment=HOME=/root/
Type=simple
ExecStart=/opt/td-agent-bit/bin/td-agent-bit -c /etc/td-agent-bit/td-agent-bit.conf
Restart=always

[Install]
WantedBy=multi-user.target

Enable td agent bit

$ systemctl enable td-agent-bit

Start td-agent-bit

$ systemctl start td-agent-bit

Observe td-agent-bit log and find problem described.

Expected behavior

With same config above, up until td-agent-bit version 1.8.6, even when kinesis stream throttles due to insufficient shard count, it wouldn't start consuming CPU/Memory to near 100%.

Screenshots

Your Environment

Version used: v1.8.8 (but noticed near identical behavior in v1.8.7 as well)

Configuration:

sh-4.2$ sudo cat /etc/td-agent-bit/td-agent-bit.conf 
[INPUT]
    Name   dummy
    Dummy  {"message":"dummy"}
    Rate   1000
    Tag    loadtest

[OUTPUT]
    Name  kinesis_streams
    Match loadtest
    region us-west-2
    stream fluent-bit-1-8-8-test
    auto_retry_requests false

Environment name and version (e.g. Kubernetes? What version?):
Server type and version: Amazon Linux
Operating System and version: amzn2-ami-hvm-2.0.20211001.1-x86_64-gp2
Filters and plugins: Just Output kinesis_streams

Additional context

I am wondering what would've potentially happened in between v1.8.6 and v1.8.7/8. I had same td-agent-bit config (without auto_retry_requests false) that's been working for a while, and started breaking (td-agent-bit starts consuming all the resource in node) since v1.8.7+ anytime kinesis stream starts throttling (inevitable). Would appreciate some insight.

The text was updated successfully, but these errors were encountered:

nokute78 · 2021-11-07T23:35:59Z

I am wondering what would've potentially happened in between v1.8.6 and v1.8.7/8.

Diffs between v1.8.6 and v1.8.7 are v1.8.6...v1.8.7

#4047 is one of them and it is to be enable unlimited response buffer for aws http client.

@hossain-rayhan @PettitWesley What do you think about it ?
Can AWS response cause OOM ?

Note: Other PRs for AWS.
#4046 : Auto retry config. I think it doesn't affect if auto_retry_requests is false(default).
#4075 : Change log level.
#4074 : Change the name.

hossain-rayhan · 2021-11-08T04:10:30Z

Hi @nokute78, thanks for commenting. Unlimited response buffer for aws http client might be the reason here. We will check and report back here.

PettitWesley · 2021-11-09T01:01:35Z

@nokute78 @hossain-rayhan I don't see how the unlimited buffer change could cause this... there is a limit on the max return size of the AWS Kinesis API. There is no published max size, but there is a limit imposed by the response schema. The only thing that can make the response long is returning an error for each record, and a single call can only have 500 records. So you can do the math that the max response size can't be too huge: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

This issue does not surprise me at all and I do not see any evidence it's a bug. When you hit throttling and Fluent Bit starts doing retries, that means you're buffering more logs and using more CPU on the failures and retries... high resource usage is expected. The main solution is to scale your kinesis stream.

Some of this is discussed in the new debugging guide I created for AWS issues: aws/aws-for-fluent-bit#266

hossain-rayhan · 2021-11-09T17:06:38Z

@PettitWesley is also right here. We can get error response back for at most 500 records with an error code and a message which should not cause this OOM kill. However, I need to know why its only happening for v1.8.7+.

@nokute78 and @schoi-godaddy is this behavior consistent for v1.8.7+ (OOM kill) and v1.8.6. Can you confirm that you see all the logs in the destination when using v1.8.6? Also, can you please share full logs for v1.8.6?

gabegorelick · 2021-11-11T17:01:08Z

I'm also seeing consistent OOM kills in 1.8.7 that don't occur in 1.8.6. But I'm using cloudwatch_logs output, not kinesis.

I run FB with a 200Mi limit, filesystem storage enabled, and fairly low mem_buf_limits (5-50MB). No issues in any version < 1.8.7, except as expected I do run into #4044 in versions before the unlimited aws response buffer fix.

The OOM kills are very consistent and only affect fluent-bit pods, so I assume it's hitting the 200Mi limit I set (EDIT: I checked the kernel logs and this is indeed what's happening). I start seeing them within a few minutes of fluent-bit 1.8.7 pods starting.

I'll try to work through the steps outlined in aws/aws-for-fluent-bit#266 and report back what I find.

gabegorelick · 2021-11-11T23:19:54Z

Here's the end of some debug logs from 2 fluent-bit pods before they were OOM killed. Both times the logs stop at the same place with no errors or warnings.

[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] PutLogEvents http status=200
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent events to k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent 9 events to CloudWatch
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sending 14 events
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sending log events to log stream k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:44:24] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] PutLogEvents http status=200
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent events to k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent 14 events to CloudWatch
[2021/11/11 22:44:24] [ info] [output:cloudwatch_logs:cloudwatch_logs_kube] Creating log stream XXX.log in log group XXX
[2021/11/11 22:44:24] [debug] [upstream] KA connection #271 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:44:24] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..

[2021/11/11 22:46:32] [debug] [upstream] KA connection #273 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:46:32] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:46:32] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:46:32] [debug] [upstream] KA connection #273 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] PutLogEvents http status=200
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] Sent events to journal-kubelet.XXXX
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] Sent 12 events to CloudWatch
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sending 1 events
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sending log events to log stream system-container-kube.var.log.containersXXXX.log
[2021/11/11 22:46:32] [debug] [upstream] KA connection #268 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:46:32] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:46:32] [debug] [aws_credentials] Requesting credentials from the EKS provider..

gabegorelick · 2021-11-12T21:20:41Z

At least for my setup, I can confirm that reverting 8734df8 fixes the issue.

PettitWesley · 2021-11-17T23:35:44Z

@gabegorelick So I am personally still doubtful that the response buffer change could cause a significant enough increase in memory to be noticeable. And even if it does, that change is needed, we need to be able to read the full response, so this can't be a bug. I also don't detect evidence of a memory leak, in the sense of memory being allocated and not freed.

However, may be I am wrong. You can help us get more data by deploying the following commit I added: https://github.com/PettitWesley/fluent-bit/tree/issue-4192

It will print the final buffer size and endpoint for each request, so we can see how large the buffer is actually getting:

[2021/11/17 23:04:36] [ info] final resp buf size=4096, endpoint=firehose.us-west-2.amazonaws.com
[2021/11/17 23:04:41] [ info] final resp buf size=4096, endpoint=firehose.us-west-2.amazonaws.com

I have built an image which can be pulled from any AWS account, which has that commit on top of the latest commits in the 1.8 branch:

144718711470.dkr.ecr.us-west-2.amazonaws.com/issue-4192:latest

You can pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/issue-4192:latest

You will still need valid credentials with ECR read permissions to pull it; I've made the repo readable from any account.

gabegorelick · 2021-11-18T03:33:47Z

All resp bufs appear to be 4096.

[2021/11/18 03:28:14] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:14] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com

Still seeing OOM kills though.

Anything else I should look for in the logs?

dylanlingelbach · 2021-11-29T19:52:37Z

@PettitWesley looking through this if the response size was over the memory limit for the container could we see the behavior @gabegorelick describes above? Not sure why we'd see a 200MB+ response but I think a response like that could generate this behavior?

I am very new to the fluentbit code base and not a C/C++ expert so could be off - just throwing out an idea based on what we are seeing.

PettitWesley · 2021-11-30T00:07:36Z

@dylanlingelbach

So essentially, high memory usage is expected when you hit throttling, and with mem limits configured, it's thus easy to get get an OOMKill. At a high level, this is expected behavior.

Now may be something changed in Fluent Bit 1.8.7 that caused it to use more memory than before, and so then we have two options:

There's an actual memory leak: We can detect these with Valgrind and other approaches. We don't have reports or evidence of an actual memory leak AKA Fluent Bit is allocating memory and not free-ing it when its done using it. If you want to try this yourself, testing with Valgrind is easy: https://github.com/aws/aws-for-fluent-bit/pull/266/files#diff-a0282cadf5f1f6cae40b5c8a42fd2e1863ea53e8b4fa308bfd5a40d4e2a41b07R55
If there's no actual memory leak, then Fluent Bit is using a lot of memory for legitimate purposes, the high mem usage comes from memory that is actually being used. It is possible than we changed some code which makes the memory usage sub-optimal. May be it could be improved. Profiling could be done. I don't have time for that right now though.

Basically, I have limited time, and to put efforts on investigating this claim I need better evidence that there was a real non-trivial change in the performance of Fluent Bit here.

Something like, two test cases which I can easily reproduce, and that you run for a long time or multiple times, with prometheus graphs that show me that the CPU/Mem usage is truly non-trivially higher in 1.8.7. If you can show me something like that, and then I can repro it myself, then I can try to schedule time to dive into this deeper.

Or, you can try going through AWS Support and trying to have them escalate it to me that way, but I prefer the above.

dylanlingelbach · 2021-11-30T02:19:12Z

@PettitWesley got it. Our time is limited as well. I'll raise with AWS support and if we have time, we'll dig in further.

For context, the version of fluent-bit (1.7.9) we are using is on an older Debian base image which has some OpenSSL security vulnerabilities which were flagged. While it's highly unlikely those are exploitable in the context we are using fluent-bit, our customers and operating environment put a lot of pressure on us to update to a fixed version. While we could roll our own fluent-bit image with an updated base image, doing so rather than using an off the shelf AWS image invites a higher level of scrutiny for us that we are trying to avoid.

In addition to the above, we do try to stay up to date with our dependencies.

Appreciate the response and hoping our time frees up soon so we can dig in.

PettitWesley · 2021-11-30T03:30:32Z

@dylanlingelbach

For context, the version of fluent-bit (1.7.9) we are using is on an older Debian base image

So you're not currently using the AWS for Fluent Bit distro?

put a lot of pressure on us to update to a fixed version.

What AWS for Fluent Bit versions have you tried and replicated the high memory usage in? Our latest is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2

Do you have any metrics graphs to compare the resource usage? Or anything to quantify this further?

dylanlingelbach · 2021-11-30T03:42:14Z

@PettitWesley we are not using aws-for-fluent-bit - we are using fluent-bit directly. I'll let @gabegorelick chime in with any other specifics on how we are using fluent-bit today.

1.8.6 does not have the OOM issues, whereas 1.8.7 does (#4192 (comment)). @gabegorelick was able to pinpoint the specific commit: #4192 (comment).

We currently have dropped logs due to what was fixed in 1.8.7..

I will see if we can pull metrics graphs. We initially noticed this due to the OOM errors in the pods. I will see if we can pull any more specific info on what was being logged at that point in time but it might be next week as we have a tight deadline on a project this week.

PettitWesley · 2021-11-30T03:49:21Z

@dylanlingelbach Yeah if you can give me actual data on a non-trivial diff in memory usage, that'd be interesting. Otherwise, the OOMKill is just based on a limit you set, its a single bit of information from my POV- was the mem usage above or below the limit. I want to know how much.

Also I am still very doubtful the buffer fix could explain this... you guys are only using CW? If you look at the CW API, it always returns a small response without many fields: https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html

PettitWesley · 2021-11-30T03:52:03Z

At least for my setup, I can confirm that reverting 8734df8 fixes the issue.
@gabegorelick was able to pinpoint the specific commit: #4192 (comment).

Can you provide details on what your setup was and how you determined this? And again, can you provide the actual memory usage value with the different versions you tried?

dylanlingelbach · 2021-11-30T03:54:45Z

@PettitWesley yes, we can provide these details - it might be next week before we can get you specifics as we have a deadline this week for an unrelated project

PettitWesley · 2021-11-30T03:59:12Z

@dylanlingelbach Ok that's fine. Once you have everything ready, please create a new issue in the AWS for Fluent Bit repo with all the data: https://github.com/aws/aws-for-fluent-bit

dylanlingelbach · 2021-11-30T04:07:03Z

@PettitWesley sounds good, will let you know when we've reproduced and have the data ready

dylanlingelbach · 2021-12-14T20:37:31Z

@PettitWesley finally found time to log aws/aws-for-fluent-bit#278

namevic · 2022-02-24T09:07:50Z

After updating to 1.8.12 I don't see the memory leak.

github-actions · 2022-05-26T02:17:35Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions · 2022-06-01T02:18:15Z

This issue was closed because it has been stalled for 5 days with no activity.

gabegorelick mentioned this issue Nov 11, 2021

Potential memory leak in v1.8.7 debug #4211

Closed

edsiper assigned hossain-rayhan and PettitWesley Nov 12, 2021

PettitWesley unassigned hossain-rayhan Nov 17, 2021

dylanlingelbach mentioned this issue Dec 14, 2021

OOMs/crashes when using fluent-bit 1.8.11 aws/aws-for-fluent-bit#278

Closed

github-actions bot added the Stale label May 26, 2022

github-actions bot closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting `v1.8.7` agent consumes all resource on node, when kinesis_stream output throttles. #4192

Starting `v1.8.7` agent consumes all resource on node, when kinesis_stream output throttles. #4192

schoi-godaddy commented Oct 14, 2021 •

edited

Loading

nokute78 commented Nov 7, 2021

hossain-rayhan commented Nov 8, 2021

PettitWesley commented Nov 9, 2021

hossain-rayhan commented Nov 9, 2021 •

edited

Loading

gabegorelick commented Nov 11, 2021 •

edited

Loading

gabegorelick commented Nov 11, 2021

gabegorelick commented Nov 12, 2021

PettitWesley commented Nov 17, 2021

gabegorelick commented Nov 18, 2021

dylanlingelbach commented Nov 29, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

dylanlingelbach commented Dec 14, 2021

namevic commented Feb 24, 2022

github-actions bot commented May 26, 2022

github-actions bot commented Jun 1, 2022

Starting v1.8.7 agent consumes all resource on node, when kinesis_stream output throttles. #4192

Starting v1.8.7 agent consumes all resource on node, when kinesis_stream output throttles. #4192

Comments

schoi-godaddy commented Oct 14, 2021 • edited Loading

Bug Report

nokute78 commented Nov 7, 2021

hossain-rayhan commented Nov 8, 2021

PettitWesley commented Nov 9, 2021

hossain-rayhan commented Nov 9, 2021 • edited Loading

gabegorelick commented Nov 11, 2021 • edited Loading

gabegorelick commented Nov 11, 2021

gabegorelick commented Nov 12, 2021

PettitWesley commented Nov 17, 2021

gabegorelick commented Nov 18, 2021

dylanlingelbach commented Nov 29, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

PettitWesley commented Nov 30, 2021

dylanlingelbach commented Nov 30, 2021

dylanlingelbach commented Dec 14, 2021

namevic commented Feb 24, 2022

github-actions bot commented May 26, 2022

github-actions bot commented Jun 1, 2022

Starting `v1.8.7` agent consumes all resource on node, when kinesis_stream output throttles. #4192

Starting `v1.8.7` agent consumes all resource on node, when kinesis_stream output throttles. #4192

schoi-godaddy commented Oct 14, 2021 •

edited

Loading

hossain-rayhan commented Nov 9, 2021 •

edited

Loading

gabegorelick commented Nov 11, 2021 •

edited

Loading