Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting v1.8.7 agent consumes all resource on node, when kinesis_stream output throttles. #4192

Closed
schoi-godaddy opened this issue Oct 14, 2021 · 23 comments
Assignees
Labels

Comments

@schoi-godaddy
Copy link

schoi-godaddy commented Oct 14, 2021

Bug Report

Describe the bug

td-agent-bit v1.8.8 with kinesis_streams output starts consuming near 100% CPU and Memory when kinesis stream starts throttling due to shard it is configured with. Then it eventually gets killed by oom_reaper and repeats this cycle over and over while consuming CPU/Memory of the node. The main problem is, this never happened prior to v1.8.7.

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
# /var/log/message

Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: #033[1mFluent Bit v1.8.8#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: * https://fluentbit.io
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [engine] started (pid=7690)
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] version=1.1.4, initializing...
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] in-memory
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [cmetrics] version=0.2.2
Oct 14 21:06:21 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:21] [ info] [sp] stream processor started
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] 94 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:26 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:26] [ warn] [engine] failed to flush chunk '7690-1634245512.788940221.flb', retry in 6 seconds: task_id=0, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] 97 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:31 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:31] [ warn] [engine] failed to flush chunk '7690-1634245534.977337015.flb', retry in 10 seconds: task_id=1, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] 33 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:32 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:32] [ warn] [engine] chunk '7690-1634245512.788940221.flb' cannot be retried: task_id=0, input=dummy.0 > output=kinesis_streams.0
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] 96 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:36 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:36] [ warn] [engine] failed to flush chunk '7690-1634245514.968038432.flb', retry in 8 seconds: task_id=0, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] 96 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [ warn] [engine] chunk '7690-1634245534.977337015.flb' cannot be retried: task_id=1, input=dummy.0 > output=kinesis_streams.0
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] 400 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send log records
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [error] [output:kinesis_streams:kinesis_streams.0] Failed to send records to kinesis
Oct 14 21:06:41 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:41] [ warn] [engine] failed to flush chunk '7690-1634245512.968040099.flb', retry in 8 seconds: task_id=2, input=dummy.0 > output=kinesis_streams.0 (out_id=0)
Oct 14 21:06:44 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:06:44] [error] [output:kinesis_streams:kinesis_streams.0] 338 out of 500 records failed to be delivered, will retry this batch, fluent-bit-1-8-8-test
...
Oct 14 21:10:12 ip-10-10-10-10 kernel: Out of memory: Kill process 7690 (td-agent-bit) score 820 or sacrifice child
Oct 14 21:10:12 ip-10-10-10-10 kernel: Killed process 7690 (td-agent-bit) total-vm:1761932kB, anon-rss:850232kB, file-rss:0kB, shmem-rss:0kB
Oct 14 21:10:12 ip-10-10-10-10 kernel: oom_reaper: reaped process 7690 (td-agent-bit), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Oct 14 21:10:12 ip-10-10-10-10 systemd: td-agent-bit.service: main process exited, code=killed, status=9/KILL
Oct 14 21:10:12 ip-10-10-10-10 systemd: Unit td-agent-bit.service entered failed state.
Oct 14 21:10:12 ip-10-10-10-10 systemd: td-agent-bit.service failed.
Oct 14 21:10:13 ip-10-10-10-10 systemd: td-agent-bit.service holdoff time over, scheduling restart.
Oct 14 21:10:13 ip-10-10-10-10 systemd: Stopped TD Agent Bit.
Oct 14 21:10:13 ip-10-10-10-10 systemd: Started TD Agent Bit.
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: #033[1mFluent Bit v1.8.8#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: * https://fluentbit.io
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [engine] started (pid=7710)
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] version=1.1.4, initializing...
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] in-memory
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [cmetrics] version=0.2.2
Oct 14 21:10:13 ip-10-10-10-10 td-agent-bit: [2021/10/14 21:10:13] [ info] [sp] stream processor started
sh-4.2$ top

top - 21:10:10 up 39 min,  0 users,  load average: 0.28, 0.07, 0.02
Tasks:  85 total,   1 running,  48 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.0 us,  2.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1006888 total,    67624 free,   724416 used,   214848 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   143248 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                
 7690 root      20   0  838284 629676   8856 S 99.0 62.5   0:29.70 td-agent-bit                                                                                                           
  ... 
  • Steps to reproduce the problem:
  1. Start EC2 instance with regular amazonlinux v2 AMI.
  2. Create an AWS Kinesis stream, remember region/name of the kinesis stream. Make sure to set the shard to 1 to simulate 1000 counts/sec limit.
  3. Add td agent yum repo to /etc/yum.repos.d/td.repo.
[td-agent-bit]
name=TD Agent Bit
baseurl=https://packages.fluentbit.io/amazonlinux/2/$basearch/
gpgcheck=1
gpgkey=https://packages.fluentbit.io/fluentbit.key
enabled=1
  1. Install td-agent-bit
$ yum install td-agent-bit -y
  1. Replace td agent bit config (/etc/td-agent-bit/td-agent-bit.conf) with following: replace region and stream field with your kinesis stream region/name
[INPUT]
    Name   dummy
    Dummy  {"message":"dummy"}
    Rate   1000
    Tag    loadtest

[OUTPUT]
    Name  kinesis_streams
    Match loadtest
    region us-west-2
    stream fluent-bit-1-8-8-test
    auto_retry_requests false
  1. Replace systemd td agent bit service config (/usr/lib/systemd/system/td-agent-bit.service)
[Unit]
Description=TD Agent Bit
Requires=network.target
After=network.target

[Service]
EnvironmentFile=/etc/environment
Environment=HOME=/root/
Type=simple
ExecStart=/opt/td-agent-bit/bin/td-agent-bit -c /etc/td-agent-bit/td-agent-bit.conf
Restart=always

[Install]
WantedBy=multi-user.target
  1. Enable td agent bit
$ systemctl enable td-agent-bit
  1. Start td-agent-bit
$ systemctl start td-agent-bit
  1. Observe td-agent-bit log and find problem described.

Expected behavior

With same config above, up until td-agent-bit version 1.8.6, even when kinesis stream throttles due to insufficient shard count, it wouldn't start consuming CPU/Memory to near 100%.

Screenshots

Your Environment

  • Version used: v1.8.8 (but noticed near identical behavior in v1.8.7 as well)

  • Configuration:

    sh-4.2$ sudo cat /etc/td-agent-bit/td-agent-bit.conf 
    [INPUT]
        Name   dummy
        Dummy  {"message":"dummy"}
        Rate   1000
        Tag    loadtest
    
    [OUTPUT]
        Name  kinesis_streams
        Match loadtest
        region us-west-2
        stream fluent-bit-1-8-8-test
        auto_retry_requests false
  • Environment name and version (e.g. Kubernetes? What version?):

  • Server type and version: Amazon Linux

  • Operating System and version: amzn2-ami-hvm-2.0.20211001.1-x86_64-gp2

  • Filters and plugins: Just Output kinesis_streams

Additional context

I am wondering what would've potentially happened in between v1.8.6 and v1.8.7/8. I had same td-agent-bit config (without auto_retry_requests false) that's been working for a while, and started breaking (td-agent-bit starts consuming all the resource in node) since v1.8.7+ anytime kinesis stream starts throttling (inevitable). Would appreciate some insight.

@nokute78
Copy link
Collaborator

nokute78 commented Nov 7, 2021

I am wondering what would've potentially happened in between v1.8.6 and v1.8.7/8.

Diffs between v1.8.6 and v1.8.7 are v1.8.6...v1.8.7

#4047 is one of them and it is to be enable unlimited response buffer for aws http client.

@hossain-rayhan @PettitWesley What do you think about it ?
Can AWS response cause OOM ?

Note: Other PRs for AWS.
#4046 : Auto retry config. I think it doesn't affect if auto_retry_requests is false(default).
#4075 : Change log level.
#4074 : Change the name.

@hossain-rayhan
Copy link
Contributor

Hi @nokute78, thanks for commenting. Unlimited response buffer for aws http client might be the reason here. We will check and report back here.

@PettitWesley
Copy link
Contributor

@nokute78 @hossain-rayhan I don't see how the unlimited buffer change could cause this... there is a limit on the max return size of the AWS Kinesis API. There is no published max size, but there is a limit imposed by the response schema. The only thing that can make the response long is returning an error for each record, and a single call can only have 500 records. So you can do the math that the max response size can't be too huge: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

This issue does not surprise me at all and I do not see any evidence it's a bug. When you hit throttling and Fluent Bit starts doing retries, that means you're buffering more logs and using more CPU on the failures and retries... high resource usage is expected. The main solution is to scale your kinesis stream.

Some of this is discussed in the new debugging guide I created for AWS issues: aws/aws-for-fluent-bit#266

@hossain-rayhan
Copy link
Contributor

hossain-rayhan commented Nov 9, 2021

@PettitWesley is also right here. We can get error response back for at most 500 records with an error code and a message which should not cause this OOM kill. However, I need to know why its only happening for v1.8.7+.

@nokute78 and @schoi-godaddy is this behavior consistent for v1.8.7+ (OOM kill) and v1.8.6. Can you confirm that you see all the logs in the destination when using v1.8.6? Also, can you please share full logs for v1.8.6?

@gabegorelick
Copy link

gabegorelick commented Nov 11, 2021

I'm also seeing consistent OOM kills in 1.8.7 that don't occur in 1.8.6. But I'm using cloudwatch_logs output, not kinesis.

I run FB with a 200Mi limit, filesystem storage enabled, and fairly low mem_buf_limits (5-50MB). No issues in any version < 1.8.7, except as expected I do run into #4044 in versions before the unlimited aws response buffer fix.

The OOM kills are very consistent and only affect fluent-bit pods, so I assume it's hitting the 200Mi limit I set (EDIT: I checked the kernel logs and this is indeed what's happening). I start seeing them within a few minutes of fluent-bit 1.8.7 pods starting.

I'll try to work through the steps outlined in aws/aws-for-fluent-bit#266 and report back what I find.

@gabegorelick
Copy link

Here's the end of some debug logs from 2 fluent-bit pods before they were OOM killed. Both times the logs stop at the same place with no errors or warnings.

[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] PutLogEvents http status=200
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent events to k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent 9 events to CloudWatch
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sending 14 events
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sending log events to log stream k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:44:24] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:44:24] [debug] [upstream] KA connection #277 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] PutLogEvents http status=200
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent events to k8s-prometheus.prometheus-7798658975-9mr4m
[2021/11/11 22:44:24] [debug] [output:cloudwatch_logs:cloudwatch_logs_prometheus] Sent 14 events to CloudWatch
[2021/11/11 22:44:24] [ info] [output:cloudwatch_logs:cloudwatch_logs_kube] Creating log stream XXX.log in log group XXX
[2021/11/11 22:44:24] [debug] [upstream] KA connection #271 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:44:24] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:44:24] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:46:32] [debug] [upstream] KA connection #273 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:46:32] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:46:32] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2021/11/11 22:46:32] [debug] [upstream] KA connection #273 to logs.us-east-1.amazonaws.com:443 is now available
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] PutLogEvents http status=200
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] Sent events to journal-kubelet.XXXX
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] Sent 12 events to CloudWatch
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sending 1 events
[2021/11/11 22:46:32] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sending log events to log stream system-container-kube.var.log.containersXXXX.log
[2021/11/11 22:46:32] [debug] [upstream] KA connection #268 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/11/11 22:46:32] [debug] [http_client] not using http_proxy for header
[2021/11/11 22:46:32] [debug] [aws_credentials] Requesting credentials from the EKS provider..

@gabegorelick
Copy link

At least for my setup, I can confirm that reverting 8734df8 fixes the issue.

@PettitWesley
Copy link
Contributor

@gabegorelick So I am personally still doubtful that the response buffer change could cause a significant enough increase in memory to be noticeable. And even if it does, that change is needed, we need to be able to read the full response, so this can't be a bug. I also don't detect evidence of a memory leak, in the sense of memory being allocated and not freed.

However, may be I am wrong. You can help us get more data by deploying the following commit I added: https://github.com/PettitWesley/fluent-bit/tree/issue-4192

It will print the final buffer size and endpoint for each request, so we can see how large the buffer is actually getting:

[2021/11/17 23:04:36] [ info] final resp buf size=4096, endpoint=firehose.us-west-2.amazonaws.com
[2021/11/17 23:04:41] [ info] final resp buf size=4096, endpoint=firehose.us-west-2.amazonaws.com

I have built an image which can be pulled from any AWS account, which has that commit on top of the latest commits in the 1.8 branch:

144718711470.dkr.ecr.us-west-2.amazonaws.com/issue-4192:latest

You can pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/issue-4192:latest

You will still need valid credentials with ECR read permissions to pull it; I've made the repo readable from any account.

@gabegorelick
Copy link

All resp bufs appear to be 4096.

[2021/11/18 03:28:14] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:14] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com
[2021/11/18 03:28:19] [ info] final resp buf size=4096, endpoint=logs.us-east-1.amazonaws.com

Still seeing OOM kills though.

image

Anything else I should look for in the logs?

@dylanlingelbach
Copy link

@PettitWesley looking through this if the response size was over the memory limit for the container could we see the behavior @gabegorelick describes above? Not sure why we'd see a 200MB+ response but I think a response like that could generate this behavior?

I am very new to the fluentbit code base and not a C/C++ expert so could be off - just throwing out an idea based on what we are seeing.

@PettitWesley
Copy link
Contributor

@dylanlingelbach

So essentially, high memory usage is expected when you hit throttling, and with mem limits configured, it's thus easy to get get an OOMKill. At a high level, this is expected behavior.

Now may be something changed in Fluent Bit 1.8.7 that caused it to use more memory than before, and so then we have two options:

  1. There's an actual memory leak: We can detect these with Valgrind and other approaches. We don't have reports or evidence of an actual memory leak AKA Fluent Bit is allocating memory and not free-ing it when its done using it. If you want to try this yourself, testing with Valgrind is easy: https://github.com/aws/aws-for-fluent-bit/pull/266/files#diff-a0282cadf5f1f6cae40b5c8a42fd2e1863ea53e8b4fa308bfd5a40d4e2a41b07R55
  2. If there's no actual memory leak, then Fluent Bit is using a lot of memory for legitimate purposes, the high mem usage comes from memory that is actually being used. It is possible than we changed some code which makes the memory usage sub-optimal. May be it could be improved. Profiling could be done. I don't have time for that right now though.

Basically, I have limited time, and to put efforts on investigating this claim I need better evidence that there was a real non-trivial change in the performance of Fluent Bit here.

Something like, two test cases which I can easily reproduce, and that you run for a long time or multiple times, with prometheus graphs that show me that the CPU/Mem usage is truly non-trivially higher in 1.8.7. If you can show me something like that, and then I can repro it myself, then I can try to schedule time to dive into this deeper.

Or, you can try going through AWS Support and trying to have them escalate it to me that way, but I prefer the above.

@dylanlingelbach
Copy link

@PettitWesley got it. Our time is limited as well. I'll raise with AWS support and if we have time, we'll dig in further.

For context, the version of fluent-bit (1.7.9) we are using is on an older Debian base image which has some OpenSSL security vulnerabilities which were flagged. While it's highly unlikely those are exploitable in the context we are using fluent-bit, our customers and operating environment put a lot of pressure on us to update to a fixed version. While we could roll our own fluent-bit image with an updated base image, doing so rather than using an off the shelf AWS image invites a higher level of scrutiny for us that we are trying to avoid.

In addition to the above, we do try to stay up to date with our dependencies.

Appreciate the response and hoping our time frees up soon so we can dig in.

@PettitWesley
Copy link
Contributor

@dylanlingelbach

For context, the version of fluent-bit (1.7.9) we are using is on an older Debian base image

So you're not currently using the AWS for Fluent Bit distro?

put a lot of pressure on us to update to a fixed version.

What AWS for Fluent Bit versions have you tried and replicated the high memory usage in? Our latest is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2

Do you have any metrics graphs to compare the resource usage? Or anything to quantify this further?

@dylanlingelbach
Copy link

@PettitWesley we are not using aws-for-fluent-bit - we are using fluent-bit directly. I'll let @gabegorelick chime in with any other specifics on how we are using fluent-bit today.

1.8.6 does not have the OOM issues, whereas 1.8.7 does (#4192 (comment)). @gabegorelick was able to pinpoint the specific commit: #4192 (comment).

We currently have dropped logs due to what was fixed in 1.8.7..

I will see if we can pull metrics graphs. We initially noticed this due to the OOM errors in the pods. I will see if we can pull any more specific info on what was being logged at that point in time but it might be next week as we have a tight deadline on a project this week.

@PettitWesley
Copy link
Contributor

@dylanlingelbach Yeah if you can give me actual data on a non-trivial diff in memory usage, that'd be interesting. Otherwise, the OOMKill is just based on a limit you set, its a single bit of information from my POV- was the mem usage above or below the limit. I want to know how much.

Also I am still very doubtful the buffer fix could explain this... you guys are only using CW? If you look at the CW API, it always returns a small response without many fields: https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html

@PettitWesley
Copy link
Contributor

At least for my setup, I can confirm that reverting 8734df8 fixes the issue.
@gabegorelick was able to pinpoint the specific commit: #4192 (comment).

Can you provide details on what your setup was and how you determined this? And again, can you provide the actual memory usage value with the different versions you tried?

@dylanlingelbach
Copy link

@PettitWesley yes, we can provide these details - it might be next week before we can get you specifics as we have a deadline this week for an unrelated project

@PettitWesley
Copy link
Contributor

@dylanlingelbach Ok that's fine. Once you have everything ready, please create a new issue in the AWS for Fluent Bit repo with all the data: https://github.com/aws/aws-for-fluent-bit

@dylanlingelbach
Copy link

@PettitWesley sounds good, will let you know when we've reproduced and have the data ready

@dylanlingelbach
Copy link

@PettitWesley finally found time to log aws/aws-for-fluent-bit#278

@namevic
Copy link

namevic commented Feb 24, 2022

After updating to 1.8.12 I don't see the memory leak.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label May 26, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2022

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants