Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

Closed
rajaie-sg opened this issue Apr 21, 2022 · 11 comments

Comments

@rajaie-sg
Copy link

rajaie-sg commented Apr 21, 2022

Describe the question/issue

I am running Fluenbit as a Daemonset in EKS and have it configured to send logs to a Kinesis Firehose stream using the kinesis_firehose plugin. The Fluent Bit logs are showing me a lot of error and warnings messages like below:

[error] [output:kinesis_firehose:kinesis_firehose.0] Thoughput limits may have been exceeded, PUT-S3-sdVlo
[error] [output:kinesis_firehose:kinesis_firehose.0] PutRecordBatch request returned with no records successfully recieved, PUT-S3-sdVlo
[error] [http_client] broken connection to firehose.us-west-2.amazonaws.com:443 ?
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to PUT-S3-sdVlo
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[ warn] [engine] failed to flush chunk '1-1650551707.85849208.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=kinesis_firehose.0 (out_id=0)

In some cases, the chunks that fail to be flushed are retried successfully, in other cases I see this error

[ warn] [engine] chunk '1-1650551689.51293014.flb' cannot be retried: task_id=7, input=tail.0 > output=kinesis_firehose.0

I have followed along the many GitHub issues that discuss the same issue, and I have tried the recommended fixes in them, but I am still seeing the error.

Configuration

Fluent Bit Configuration File

    [SERVICE]
        Parsers_File parser.conf
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*_production_mycontainer-*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     256MB
        DB.locking        true
        Rotate_Wait       30
        Docker_Mode       On
        Docker_Mode_Flush 10
        Skip_Long_Lines   On
        Refresh_Interval  10
    [FILTER]
        Name kubernetes
        Match kube.*
        Labels Off
        Annotations Off
    [OUTPUT]
        Name                         kinesis_firehose
        Match                        *
        delivery_stream              PUT-S3-sdVlo
        region                       us-west-2
  parser.conf: |
    [PARSER]
        Name   docker
        Format json

Fluent Bit Log Output

Logs from Fluent Bit container startup:

Fluent Bit v1.8.11
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/04/21 01:04:52] [ info] [engine] started (pid=1)
[2022/04/21 01:04:52] [ info] [storage] version=1.1.5, initializing...
[2022/04/21 01:04:52] [ info] [storage] in-memory
[2022/04/21 01:04:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/04/21 01:04:52] [ info] [cmetrics] version=0.2.2
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2022/04/21 01:04:52] [ info] [sp] stream processor started

Fluent Bit Version Info

Fluent Bit v1.8.11. I am using the Fluent Bit Docker image from this comment #288 (comment)

I also tried using 906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.23.3 but still ran into the issue.

There were some reports saying that the older version of Fluent Bit didn't have this issue, so I tried this version of the aws-for-fluent-bit Docker image https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.16.0 but still ran into the issue.

Other reports said that the firehose (which has been replaced with kinesis_firehose) Fluent Bit plugin doesn't have this issue, so I tried that but still ran into problem.

Cluster Details

EKS Setup

Fluent Bit is deployed on my EKS cluster as a Daemon Set. The EKS cluster has about 30 nodes, but Fluent Bit is currently only monitoring logs from 3 pods across the cluster. This is due to the Path configuration I've set under the INPUT section (we only have 3 replicas of "mycontainer" running the "production" namespace.)

        Path              /var/log/containers/*_production_mycontainer-*.log

I thought it may have to do with load, so I deployed Fluent Bit as a Deployment with 1 replica, but I still saw the issues above.

IAM: The Kinesis Daemon Set Pod use a Service Account that has an IAM role configured using the eks.amazonaws.com/role-arn: annotation.

Kinesis Firehose Metrics
image
image

One thing I don't understand fully is why there are "Throttled Records" despite the "Incoming XXXX" metrics being far away from the throughput limits for the stream.

Kinesis Firehose Configuration
image
image

Server-side encryption is enabled

Application Details

A single replica of our application logs approximately 266 lines per second. The total size of those 266 lines is approximately 54 KB. We have 3 replicas of our application, each running on a different node.

We are also using the kubernetes Filter in Fluent Bit, so that extra metadata adds to the size of the each log record that ultimately gets sent to Firehose. The log line from the application is ~203.007519 bytes, but with the Kubernetes metadata the final log record size is ~868 bytes.

Steps to reproduce issue

I just deploy the Fluent Bit Daemonset and after a few minutes I start seeing the errors described above.

Related Issues

(The first two issues are most relevant.)

@PettitWesley
Copy link
Contributor

[error] [output:kinesis_firehose:kinesis_firehose.0] PutRecordBatch request returned with no records successfully recieved, PUT-S3-sdVlo

So this likely means the same thing as the throughput exceeded exception, the Firehose API can return partial success, or it can return a success response that tells you that none of the records in the batch were successfully put to the stream. This generally only occurs when you are being throttled.

The first action when you get throttling errors should always be to submit a limit increase. I understand that the metrics from firehose do not suggest throttling, but if Fluent Bit says its being throttled, no setting change in Fluent Bit can help. You need to figure out how not be throttled, and a limit increase is the best solution. Fluent Bit is designed to try to flush your data as quickly as possible, and you don't want to slow it down, slowing down your logging pipeline at the collection side may lead to log loss. The log destination needs to be able to scale to meet your log output rate. Please submit an AWS Support ticket to the kinesis firehose team if you need more help understanding the CW metrics they produce and why they don't match up with the throttling errors that Fluent Bit is receiving.

The other thing I want to ask is- what is your final destination from Firehose? Is it S3? Can you share your full config so I can see how many outputs you have?

If you can not get a limit increase, there may be other options that you can consider, such as switching to direct upload to S3, or to using KPL aggregation to compress the log data.

@PettitWesley PettitWesley changed the title Bug - Fluentbit Kinesis Firehose Failing to send logs, broken connections Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections Apr 25, 2022
@rajaie-sg
Copy link
Author

rajaie-sg commented Apr 25, 2022

Hi @PettitWesley, thank you for your reply.

The first action when you get throttling errors should always be to submit a limit increase.

I've submitted a request to increase the Delivery Stream Throughput (MiB/Sec) limit on the stream. As I understood from here, that should increase the other limits as well ("The three quota scale proportionally.").

what is your final destination from Firehose?

Yes, S3 is the one and only destination we have from our Firehose stream. The delivery configuration is included in the screenshots I shared in my original post.

If you can not get a limit increase, there may be other options that you can consider, such as switching to direct upload to S3

I think the limit increase request should go through, so hopefully that takes care of it. I was looking into direct S3 uploads at first, this article suggested that using Firehose streams is more reliable than just sending direct to S3 (under the "Fluent Bit support for Amazon Kinesis Data Firehose" section).

If the limit increase doesn't give any good results, I will try out the Kinesis Streams or the older Firehose plugin with aggregation.

Regarding the Docker image, is this #288 (comment) the correct one I should be using until a new official version of aws-for-fluent-bit is released with the patches?

@matthewfala
Copy link
Contributor

Hi @rajaie-sg,
The priority event loop patch I wrote was for very high throughput usecases, around 50-100Mb/s to Firehose via Fluent Bit and like you mentioned is not yet included in the official aws-for-fluent-bit versions. If the errors are due to Firehose delivery limits, then you probably do not need the patch.

The patch will be included in the official aws-for-fluent-bit soon when we upgrade from 1.8 to the 1.9 Fluent Bit series.

@rajaie-sg
Copy link
Author

rajaie-sg commented Apr 26, 2022

Thanks for the clarification!

After increasing the limits for the Kinesis Firehose stream, the number of warnings/errors has gone down significantly. I'm also seeing 0 throttled records in the Firehose metrics.

image
image

You can see the drop-off happen at 8pm EST / 12 am UTC when the limit increase came into effect.

I am however still seeing the occasional errors as below:

Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit warn [2022/04/26 14:38:22] [ warn] [net] getaddrinfo(host='firehose.us-west-2.amazonaws.com', err=12): Timeout while contacting DNS servers
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [aws_client] connection initialization error
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to PUT-S3-sdVlo
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit warn [2022/04/26 14:38:22] [ warn] [engine] failed to flush chunk '1-1650983890.224982081.flb', retry in 6 seconds: task_id=7, input=tail.0 > output=kinesis_firehose.0

From the logs, it seems that the failure to "send log records" is always preceded by either getaddrinfo(host='firehose.us-west-2.amazonaws.com', err=12): Timeout while contacting DNS servers or broken connection to firehose.us-west-2.amazonaws.com:443 ?. The logs show the "DNS servers" issue is much more common though.

DNS server issues graphed along "send log failure"
image

broken connection to firehose graphed along "send log failure"
image

Any ideas on what may be going on or what steps I can take to resolve those errors?

@matthewfala
Copy link
Contributor

@rajaie-sg,
What log throughput roughly are you sending logs at, as in how many logs per second roughly?

Logs should be retried automatically so these errors are most likely not a problem in terms of log loss.

There are two things you could try to make things more stable:
First, enable auto_retry_requests by adding to your logConfiguration.options "auto_retry_requests": "true" in your task definition which have fluent bit quickly retry network requests that fail.

Second, set retry_limit to 2 or 5 to guarantee that you aren't loosing data. The default is 1. You can set logConfiguration.options "retry_limit": "2" in your task definition

@rajaie-sg
Copy link
Author

Hi @matthewfala,

What log throughput roughly are you sending logs at, as in how many logs per second roughly?

I have more information in the original issue description about the log throughput:

image

Does that answer your question or are you looking for another metric?

First, enable auto_retry_requests by adding to your logConfiguration.options

I am using the default value for that option, which is True according to the documentation here https://docs.fluentbit.io/manual/pipeline/outputs/firehose

Second, set retry_limit to 2 or 5 to guarantee that you aren't loosing data.

I will try that out. Btw I have Fluent Bit deployed on EKS, not Fargate.

Thanks for the help!

@matthewfala
Copy link
Contributor

Ah yes. We did change the default setting for auto_retry_requests to true. It appears the network request is not immediately retrying with auto_retry_requests because connection initialization (where dns is called) currently does not get retried on failure so you have to wait a little while for fluent bit to queue the data to send again.

54Kb doesn't seem large enough to cause saturation issues. DNS issues do come up from time to time, and are probably fine to live with as long as they aren't so frequent that data begins building up.

@PettitWesley
Copy link
Contributor

I am using the default value for that option, which is True according to the documentation here https://docs.fluentbit.io/manual/pipeline/outputs/firehose

@rajaie-sg @matthewfala It defaults true in the 1.9 series, which AWS for Fluent Bit hasn't released yet. In 1.8, it was default false.

@matthewfala
Copy link
Contributor

Thank you @PettitWesley. Then @rajaie-sg, setting auto_retry_requests to true may eliminate some of the broken pipe triggered full retries.

@rajaie-sg
Copy link
Author

Ahh I see, ok I will manually set it to True and will report back.

@rajaie-sg
Copy link
Author

Just an update, after enabling the auto_retry_requests option and setting the retry_limit to 5, things are looking better
image
each "failed to flush" seems to be accompanied by a successful retry.

So in conclusion, increasing the Firehose stream quotas + configuring the retry limit options seems to have really improved things.

I am going to close this issue now, thank you for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants