-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334
Comments
So this likely means the same thing as the throughput exceeded exception, the Firehose API can return partial success, or it can return a success response that tells you that none of the records in the batch were successfully put to the stream. This generally only occurs when you are being throttled. The first action when you get throttling errors should always be to submit a limit increase. I understand that the metrics from firehose do not suggest throttling, but if Fluent Bit says its being throttled, no setting change in Fluent Bit can help. You need to figure out how not be throttled, and a limit increase is the best solution. Fluent Bit is designed to try to flush your data as quickly as possible, and you don't want to slow it down, slowing down your logging pipeline at the collection side may lead to log loss. The log destination needs to be able to scale to meet your log output rate. Please submit an AWS Support ticket to the kinesis firehose team if you need more help understanding the CW metrics they produce and why they don't match up with the throttling errors that Fluent Bit is receiving. The other thing I want to ask is- what is your final destination from Firehose? Is it S3? Can you share your full config so I can see how many outputs you have? If you can not get a limit increase, there may be other options that you can consider, such as switching to direct upload to S3, or to using KPL aggregation to compress the log data.
|
Hi @PettitWesley, thank you for your reply.
I've submitted a request to increase the
Yes, S3 is the one and only destination we have from our Firehose stream. The delivery configuration is included in the screenshots I shared in my original post.
I think the limit increase request should go through, so hopefully that takes care of it. I was looking into direct S3 uploads at first, this article suggested that using Firehose streams is more reliable than just sending direct to S3 (under the "Fluent Bit support for Amazon Kinesis Data Firehose" section). If the limit increase doesn't give any good results, I will try out the Kinesis Streams or the older Firehose plugin with aggregation. Regarding the Docker image, is this #288 (comment) the correct one I should be using until a new official version of |
Hi @rajaie-sg, The patch will be included in the official aws-for-fluent-bit soon when we upgrade from 1.8 to the 1.9 Fluent Bit series. |
@rajaie-sg, Logs should be retried automatically so these errors are most likely not a problem in terms of log loss. There are two things you could try to make things more stable: Second, set retry_limit to 2 or 5 to guarantee that you aren't loosing data. The default is 1. You can set logConfiguration.options |
Hi @matthewfala,
I have more information in the original issue description about the log throughput: Does that answer your question or are you looking for another metric?
I am using the default value for that option, which is
I will try that out. Btw I have Fluent Bit deployed on EKS, not Fargate. Thanks for the help! |
Ah yes. We did change the default setting for auto_retry_requests to true. It appears the network request is not immediately retrying with auto_retry_requests because connection initialization (where dns is called) currently does not get retried on failure so you have to wait a little while for fluent bit to queue the data to send again. 54Kb doesn't seem large enough to cause saturation issues. DNS issues do come up from time to time, and are probably fine to live with as long as they aren't so frequent that data begins building up. |
@rajaie-sg @matthewfala It defaults true in the 1.9 series, which AWS for Fluent Bit hasn't released yet. In 1.8, it was default false. |
Thank you @PettitWesley. Then @rajaie-sg, setting auto_retry_requests to true may eliminate some of the broken pipe triggered full retries. |
Ahh I see, ok I will manually set it to True and will report back. |
Describe the question/issue
I am running Fluenbit as a Daemonset in EKS and have it configured to send logs to a Kinesis Firehose stream using the
kinesis_firehose
plugin. The Fluent Bit logs are showing me a lot of error and warnings messages like below:In some cases, the chunks that fail to be flushed are retried successfully, in other cases I see this error
I have followed along the many GitHub issues that discuss the same issue, and I have tried the recommended fixes in them, but I am still seeing the error.
Configuration
Fluent Bit Configuration File
Fluent Bit Log Output
Logs from Fluent Bit container startup:
Fluent Bit Version Info
Fluent Bit v1.8.11. I am using the Fluent Bit Docker image from this comment #288 (comment)
I also tried using
906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.23.3
but still ran into the issue.There were some reports saying that the older version of Fluent Bit didn't have this issue, so I tried this version of the
aws-for-fluent-bit
Docker image https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.16.0 but still ran into the issue.Other reports said that the
firehose
(which has been replaced withkinesis_firehose
) Fluent Bit plugin doesn't have this issue, so I tried that but still ran into problem.Cluster Details
EKS Setup
Fluent Bit is deployed on my EKS cluster as a Daemon Set. The EKS cluster has about 30 nodes, but Fluent Bit is currently only monitoring logs from 3 pods across the cluster. This is due to the
Path
configuration I've set under theINPUT
section (we only have 3 replicas of "mycontainer" running the "production" namespace.)I thought it may have to do with load, so I deployed Fluent Bit as a
Deployment
with 1 replica, but I still saw the issues above.IAM: The Kinesis Daemon Set Pod use a Service Account that has an IAM role configured using the
eks.amazonaws.com/role-arn:
annotation.Kinesis Firehose Metrics
One thing I don't understand fully is why there are "Throttled Records" despite the "Incoming XXXX" metrics being far away from the throughput limits for the stream.
Kinesis Firehose Configuration
Server-side encryption is enabled
Application Details
A single replica of our application logs approximately 266 lines per second. The total size of those 266 lines is approximately 54 KB. We have 3 replicas of our application, each running on a different node.
We are also using the
kubernetes
Filter in Fluent Bit, so that extra metadata adds to the size of the each log record that ultimately gets sent to Firehose. The log line from the application is ~203.007519 bytes, but with the Kubernetes metadata the final log record size is ~868 bytes.Steps to reproduce issue
I just deploy the Fluent Bit Daemonset and after a few minutes I start seeing the errors described above.
Related Issues
(The first two issues are most relevant.)
The text was updated successfully, but these errors were encountered: