-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting v1.8.7
agent consumes all resource on node, when kinesis_stream output throttles.
#4192
Comments
Diffs between v1.8.6 and v1.8.7 are v1.8.6...v1.8.7 #4047 is one of them and it is to be enable unlimited response buffer for aws http client. @hossain-rayhan @PettitWesley What do you think about it ? Note: Other PRs for AWS. |
Hi @nokute78, thanks for commenting. Unlimited response buffer for aws http client might be the reason here. We will check and report back here. |
@nokute78 @hossain-rayhan I don't see how the unlimited buffer change could cause this... there is a limit on the max return size of the AWS Kinesis API. There is no published max size, but there is a limit imposed by the response schema. The only thing that can make the response long is returning an error for each record, and a single call can only have 500 records. So you can do the math that the max response size can't be too huge: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html This issue does not surprise me at all and I do not see any evidence it's a bug. When you hit throttling and Fluent Bit starts doing retries, that means you're buffering more logs and using more CPU on the failures and retries... high resource usage is expected. The main solution is to scale your kinesis stream. Some of this is discussed in the new debugging guide I created for AWS issues: aws/aws-for-fluent-bit#266 |
@PettitWesley is also right here. We can get error response back for at most 500 records with an error code and a message which should not cause this OOM kill. However, I need to know why its only happening for v1.8.7+. @nokute78 and @schoi-godaddy is this behavior consistent for v1.8.7+ (OOM kill) and v1.8.6. Can you confirm that you see all the logs in the destination when using v1.8.6? Also, can you please share full logs for v1.8.6? |
I'm also seeing consistent OOM kills in 1.8.7 that don't occur in 1.8.6. But I'm using I run FB with a 200Mi limit, filesystem storage enabled, and fairly low mem_buf_limits (5-50MB). No issues in any version < 1.8.7, except as expected I do run into #4044 in versions before the unlimited aws response buffer fix. The OOM kills are very consistent and only affect fluent-bit pods, so I assume it's hitting the 200Mi limit I set (EDIT: I checked the kernel logs and this is indeed what's happening). I start seeing them within a few minutes of fluent-bit 1.8.7 pods starting. I'll try to work through the steps outlined in aws/aws-for-fluent-bit#266 and report back what I find. |
Here's the end of some debug logs from 2 fluent-bit pods before they were OOM killed. Both times the logs stop at the same place with no errors or warnings.
|
At least for my setup, I can confirm that reverting 8734df8 fixes the issue. |
@gabegorelick So I am personally still doubtful that the response buffer change could cause a significant enough increase in memory to be noticeable. And even if it does, that change is needed, we need to be able to read the full response, so this can't be a bug. I also don't detect evidence of a memory leak, in the sense of memory being allocated and not freed. However, may be I am wrong. You can help us get more data by deploying the following commit I added: https://github.com/PettitWesley/fluent-bit/tree/issue-4192 It will print the final buffer size and endpoint for each request, so we can see how large the buffer is actually getting:
I have built an image which can be pulled from any AWS account, which has that commit on top of the latest commits in the 1.8 branch:
You can pull it with:
You will still need valid credentials with ECR read permissions to pull it; I've made the repo readable from any account. |
All resp bufs appear to be 4096.
Still seeing OOM kills though. Anything else I should look for in the logs? |
@PettitWesley looking through this if the response size was over the memory limit for the container could we see the behavior @gabegorelick describes above? Not sure why we'd see a 200MB+ response but I think a response like that could generate this behavior? I am very new to the fluentbit code base and not a C/C++ expert so could be off - just throwing out an idea based on what we are seeing. |
So essentially, high memory usage is expected when you hit throttling, and with mem limits configured, it's thus easy to get get an OOMKill. At a high level, this is expected behavior. Now may be something changed in Fluent Bit 1.8.7 that caused it to use more memory than before, and so then we have two options:
Basically, I have limited time, and to put efforts on investigating this claim I need better evidence that there was a real non-trivial change in the performance of Fluent Bit here. Something like, two test cases which I can easily reproduce, and that you run for a long time or multiple times, with prometheus graphs that show me that the CPU/Mem usage is truly non-trivially higher in 1.8.7. If you can show me something like that, and then I can repro it myself, then I can try to schedule time to dive into this deeper. Or, you can try going through AWS Support and trying to have them escalate it to me that way, but I prefer the above. |
@PettitWesley got it. Our time is limited as well. I'll raise with AWS support and if we have time, we'll dig in further. For context, the version of fluent-bit (1.7.9) we are using is on an older Debian base image which has some OpenSSL security vulnerabilities which were flagged. While it's highly unlikely those are exploitable in the context we are using fluent-bit, our customers and operating environment put a lot of pressure on us to update to a fixed version. While we could roll our own fluent-bit image with an updated base image, doing so rather than using an off the shelf AWS image invites a higher level of scrutiny for us that we are trying to avoid. In addition to the above, we do try to stay up to date with our dependencies. Appreciate the response and hoping our time frees up soon so we can dig in. |
So you're not currently using the AWS for Fluent Bit distro?
What AWS for Fluent Bit versions have you tried and replicated the high memory usage in? Our latest is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2 Do you have any metrics graphs to compare the resource usage? Or anything to quantify this further? |
@PettitWesley we are not using 1.8.6 does not have the OOM issues, whereas 1.8.7 does (#4192 (comment)). @gabegorelick was able to pinpoint the specific commit: #4192 (comment). We currently have dropped logs due to what was fixed in 1.8.7.. I will see if we can pull metrics graphs. We initially noticed this due to the OOM errors in the pods. I will see if we can pull any more specific info on what was being logged at that point in time but it might be next week as we have a tight deadline on a project this week. |
@dylanlingelbach Yeah if you can give me actual data on a non-trivial diff in memory usage, that'd be interesting. Otherwise, the OOMKill is just based on a limit you set, its a single bit of information from my POV- was the mem usage above or below the limit. I want to know how much. Also I am still very doubtful the buffer fix could explain this... you guys are only using CW? If you look at the CW API, it always returns a small response without many fields: https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html |
Can you provide details on what your setup was and how you determined this? And again, can you provide the actual memory usage value with the different versions you tried? |
@PettitWesley yes, we can provide these details - it might be next week before we can get you specifics as we have a deadline this week for an unrelated project |
@dylanlingelbach Ok that's fine. Once you have everything ready, please create a new issue in the AWS for Fluent Bit repo with all the data: https://github.com/aws/aws-for-fluent-bit |
@PettitWesley sounds good, will let you know when we've reproduced and have the data ready |
@PettitWesley finally found time to log aws/aws-for-fluent-bit#278 |
After updating to 1.8.12 I don't see the memory leak. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
Bug Report
Describe the bug
td-agent-bit
v1.8.8
with kinesis_streams output starts consuming near 100% CPU and Memory when kinesis stream starts throttling due to shard it is configured with. Then it eventually gets killed byoom_reaper
and repeats this cycle over and over while consuming CPU/Memory of the node. The main problem is, this never happened prior tov1.8.7
.To Reproduce
sh-4.2$ top top - 21:10:10 up 39 min, 0 users, load average: 0.28, 0.07, 0.02 Tasks: 85 total, 1 running, 48 sleeping, 0 stopped, 0 zombie %Cpu(s): 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1006888 total, 67624 free, 724416 used, 214848 buff/cache KiB Swap: 0 total, 0 free, 0 used. 143248 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7690 root 20 0 838284 629676 8856 S 99.0 62.5 0:29.70 td-agent-bit ...
/etc/yum.repos.d/td.repo
.[td-agent-bit] name=TD Agent Bit baseurl=https://packages.fluentbit.io/amazonlinux/2/$basearch/ gpgcheck=1 gpgkey=https://packages.fluentbit.io/fluentbit.key enabled=1
/etc/td-agent-bit/td-agent-bit.conf
) with following: replaceregion
andstream
field with your kinesis stream region/name/usr/lib/systemd/system/td-agent-bit.service
)$ systemctl enable td-agent-bit
Expected behavior
With same config above, up until td-agent-bit version
1.8.6
, even when kinesis stream throttles due to insufficient shard count, it wouldn't start consuming CPU/Memory to near 100%.Screenshots
Your Environment
Version used:
v1.8.8
(but noticed near identical behavior inv1.8.7
as well)Configuration:
Environment name and version (e.g. Kubernetes? What version?):
Server type and version: Amazon Linux
Operating System and version:
amzn2-ami-hvm-2.0.20211001.1-x86_64-gp2
Filters and plugins: Just Output
kinesis_streams
Additional context
I am wondering what would've potentially happened in between v
1.8.6
and v1.8.7/8
. I had same td-agent-bit config (withoutauto_retry_requests false
) that's been working for a while, and started breaking (td-agent-bit starts consuming all the resource in node) sincev1.8.7+
anytime kinesis stream starts throttling (inevitable). Would appreciate some insight.The text was updated successfully, but these errors were encountered: