Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

rajaie-sg · 2022-04-21T15:58:39Z

Describe the question/issue

I am running Fluenbit as a Daemonset in EKS and have it configured to send logs to a Kinesis Firehose stream using the kinesis_firehose plugin. The Fluent Bit logs are showing me a lot of error and warnings messages like below:

[error] [output:kinesis_firehose:kinesis_firehose.0] Thoughput limits may have been exceeded, PUT-S3-sdVlo
[error] [output:kinesis_firehose:kinesis_firehose.0] PutRecordBatch request returned with no records successfully recieved, PUT-S3-sdVlo
[error] [http_client] broken connection to firehose.us-west-2.amazonaws.com:443 ?
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to PUT-S3-sdVlo
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[ warn] [engine] failed to flush chunk '1-1650551707.85849208.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=kinesis_firehose.0 (out_id=0)

In some cases, the chunks that fail to be flushed are retried successfully, in other cases I see this error

[ warn] [engine] chunk '1-1650551689.51293014.flb' cannot be retried: task_id=7, input=tail.0 > output=kinesis_firehose.0

I have followed along the many GitHub issues that discuss the same issue, and I have tried the recommended fixes in them, but I am still seeing the error.

Configuration

Fluent Bit Configuration File

    [SERVICE]
        Parsers_File parser.conf
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*_production_mycontainer-*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     256MB
        DB.locking        true
        Rotate_Wait       30
        Docker_Mode       On
        Docker_Mode_Flush 10
        Skip_Long_Lines   On
        Refresh_Interval  10
    [FILTER]
        Name kubernetes
        Match kube.*
        Labels Off
        Annotations Off
    [OUTPUT]
        Name                         kinesis_firehose
        Match                        *
        delivery_stream              PUT-S3-sdVlo
        region                       us-west-2
  parser.conf: |
    [PARSER]
        Name   docker
        Format json

Fluent Bit Log Output

Logs from Fluent Bit container startup:

Fluent Bit v1.8.11
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/04/21 01:04:52] [ info] [engine] started (pid=1)
[2022/04/21 01:04:52] [ info] [storage] version=1.1.5, initializing...
[2022/04/21 01:04:52] [ info] [storage] in-memory
[2022/04/21 01:04:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/04/21 01:04:52] [ info] [cmetrics] version=0.2.2
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2022/04/21 01:04:52] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2022/04/21 01:04:52] [ info] [sp] stream processor started

Fluent Bit Version Info

Fluent Bit v1.8.11. I am using the Fluent Bit Docker image from this comment #288 (comment)

I also tried using 906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.23.3 but still ran into the issue.

There were some reports saying that the older version of Fluent Bit didn't have this issue, so I tried this version of the aws-for-fluent-bit Docker image https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.16.0 but still ran into the issue.

Other reports said that the firehose (which has been replaced with kinesis_firehose) Fluent Bit plugin doesn't have this issue, so I tried that but still ran into problem.

Cluster Details

EKS Setup

Fluent Bit is deployed on my EKS cluster as a Daemon Set. The EKS cluster has about 30 nodes, but Fluent Bit is currently only monitoring logs from 3 pods across the cluster. This is due to the Path configuration I've set under the INPUT section (we only have 3 replicas of "mycontainer" running the "production" namespace.)

        Path              /var/log/containers/*_production_mycontainer-*.log

I thought it may have to do with load, so I deployed Fluent Bit as a Deployment with 1 replica, but I still saw the issues above.

IAM: The Kinesis Daemon Set Pod use a Service Account that has an IAM role configured using the eks.amazonaws.com/role-arn: annotation.

Kinesis Firehose Metrics

One thing I don't understand fully is why there are "Throttled Records" despite the "Incoming XXXX" metrics being far away from the throughput limits for the stream.

Kinesis Firehose Configuration

Server-side encryption is enabled

Application Details

A single replica of our application logs approximately 266 lines per second. The total size of those 266 lines is approximately 54 KB. We have 3 replicas of our application, each running on a different node.

We are also using the kubernetes Filter in Fluent Bit, so that extra metadata adds to the size of the each log record that ultimately gets sent to Firehose. The log line from the application is ~203.007519 bytes, but with the Kubernetes metadata the final log record size is ~868 bytes.

Steps to reproduce issue

I just deploy the Fluent Bit Daemonset and after a few minutes I start seeing the errors described above.

Related Issues

(The first two issues are most relevant.)

The text was updated successfully, but these errors were encountered:

PettitWesley · 2022-04-25T05:35:20Z

[error] [output:kinesis_firehose:kinesis_firehose.0] PutRecordBatch request returned with no records successfully recieved, PUT-S3-sdVlo

So this likely means the same thing as the throughput exceeded exception, the Firehose API can return partial success, or it can return a success response that tells you that none of the records in the batch were successfully put to the stream. This generally only occurs when you are being throttled.

The first action when you get throttling errors should always be to submit a limit increase. I understand that the metrics from firehose do not suggest throttling, but if Fluent Bit says its being throttled, no setting change in Fluent Bit can help. You need to figure out how not be throttled, and a limit increase is the best solution. Fluent Bit is designed to try to flush your data as quickly as possible, and you don't want to slow it down, slowing down your logging pipeline at the collection side may lead to log loss. The log destination needs to be able to scale to meet your log output rate. Please submit an AWS Support ticket to the kinesis firehose team if you need more help understanding the CW metrics they produce and why they don't match up with the throttling errors that Fluent Bit is receiving.

The other thing I want to ask is- what is your final destination from Firehose? Is it S3? Can you share your full config so I can see how many outputs you have?

If you can not get a limit increase, there may be other options that you can consider, such as switching to direct upload to S3, or to using KPL aggregation to compress the log data.

Simple aggregation option in lower performance golang plugin (can batch together logs of your small log lines into one firehose record to make each batch larger): https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit
KPL aggregation which would require switching to Kinesis Streams: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit#kpl-aggregation
Direct to S3: https://docs.fluentbit.io/manual/pipeline/outputs/s3

rajaie-sg · 2022-04-25T15:49:44Z

Hi @PettitWesley, thank you for your reply.

The first action when you get throttling errors should always be to submit a limit increase.

I've submitted a request to increase the Delivery Stream Throughput (MiB/Sec) limit on the stream. As I understood from here, that should increase the other limits as well ("The three quota scale proportionally.").

what is your final destination from Firehose?

Yes, S3 is the one and only destination we have from our Firehose stream. The delivery configuration is included in the screenshots I shared in my original post.

If you can not get a limit increase, there may be other options that you can consider, such as switching to direct upload to S3

I think the limit increase request should go through, so hopefully that takes care of it. I was looking into direct S3 uploads at first, this article suggested that using Firehose streams is more reliable than just sending direct to S3 (under the "Fluent Bit support for Amazon Kinesis Data Firehose" section).

If the limit increase doesn't give any good results, I will try out the Kinesis Streams or the older Firehose plugin with aggregation.

Regarding the Docker image, is this #288 (comment) the correct one I should be using until a new official version of aws-for-fluent-bit is released with the patches?

matthewfala · 2022-04-26T01:13:44Z

Hi @rajaie-sg,
The priority event loop patch I wrote was for very high throughput usecases, around 50-100Mb/s to Firehose via Fluent Bit and like you mentioned is not yet included in the official aws-for-fluent-bit versions. If the errors are due to Firehose delivery limits, then you probably do not need the patch.

The patch will be included in the official aws-for-fluent-bit soon when we upgrade from 1.8 to the 1.9 Fluent Bit series.

rajaie-sg · 2022-04-26T15:49:14Z

Thanks for the clarification!

After increasing the limits for the Kinesis Firehose stream, the number of warnings/errors has gone down significantly. I'm also seeing 0 throttled records in the Firehose metrics.

You can see the drop-off happen at 8pm EST / 12 am UTC when the limit increase came into effect.

I am however still seeing the occasional errors as below:

Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit warn [2022/04/26 14:38:22] [ warn] [net] getaddrinfo(host='firehose.us-west-2.amazonaws.com', err=12): Timeout while contacting DNS servers
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [aws_client] connection initialization error
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to PUT-S3-sdVlo
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit error [2022/04/26 14:38:22] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
Apr 26 10:38:22 fluentbit-cr5db aws-for-fluent-bit warn [2022/04/26 14:38:22] [ warn] [engine] failed to flush chunk '1-1650983890.224982081.flb', retry in 6 seconds: task_id=7, input=tail.0 > output=kinesis_firehose.0

From the logs, it seems that the failure to "send log records" is always preceded by either getaddrinfo(host='firehose.us-west-2.amazonaws.com', err=12): Timeout while contacting DNS servers or broken connection to firehose.us-west-2.amazonaws.com:443 ?. The logs show the "DNS servers" issue is much more common though.

DNS server issues graphed along "send log failure"

broken connection to firehose graphed along "send log failure"

Any ideas on what may be going on or what steps I can take to resolve those errors?

matthewfala · 2022-04-26T21:40:23Z

@rajaie-sg,
What log throughput roughly are you sending logs at, as in how many logs per second roughly?

Logs should be retried automatically so these errors are most likely not a problem in terms of log loss.

There are two things you could try to make things more stable:
First, enable auto_retry_requests by adding to your logConfiguration.options "auto_retry_requests": "true" in your task definition which have fluent bit quickly retry network requests that fail.

Second, set retry_limit to 2 or 5 to guarantee that you aren't loosing data. The default is 1. You can set logConfiguration.options "retry_limit": "2" in your task definition

rajaie-sg · 2022-04-26T22:04:39Z

Hi @matthewfala,

What log throughput roughly are you sending logs at, as in how many logs per second roughly?

I have more information in the original issue description about the log throughput:

Does that answer your question or are you looking for another metric?

First, enable auto_retry_requests by adding to your logConfiguration.options

I am using the default value for that option, which is True according to the documentation here https://docs.fluentbit.io/manual/pipeline/outputs/firehose

Second, set retry_limit to 2 or 5 to guarantee that you aren't loosing data.

I will try that out. Btw I have Fluent Bit deployed on EKS, not Fargate.

Thanks for the help!

matthewfala · 2022-04-26T22:37:11Z

Ah yes. We did change the default setting for auto_retry_requests to true. It appears the network request is not immediately retrying with auto_retry_requests because connection initialization (where dns is called) currently does not get retried on failure so you have to wait a little while for fluent bit to queue the data to send again.

54Kb doesn't seem large enough to cause saturation issues. DNS issues do come up from time to time, and are probably fine to live with as long as they aren't so frequent that data begins building up.

PettitWesley · 2022-04-26T23:37:38Z

I am using the default value for that option, which is True according to the documentation here https://docs.fluentbit.io/manual/pipeline/outputs/firehose

@rajaie-sg @matthewfala It defaults true in the 1.9 series, which AWS for Fluent Bit hasn't released yet. In 1.8, it was default false.

matthewfala · 2022-04-26T23:47:31Z

Thank you @PettitWesley. Then @rajaie-sg, setting auto_retry_requests to true may eliminate some of the broken pipe triggered full retries.

rajaie-sg · 2022-04-27T14:55:37Z

Ahh I see, ok I will manually set it to True and will report back.

rajaie-sg · 2022-04-28T15:41:40Z

Just an update, after enabling the auto_retry_requests option and setting the retry_limit to 5, things are looking better

each "failed to flush" seems to be accompanied by a successful retry.

So in conclusion, increasing the Firehose stream quotas + configuring the retry limit options seems to have really improved things.

I am going to close this issue now, thank you for all the help!

PettitWesley changed the title ~~Bug - Fluentbit Kinesis Firehose Failing to send logs, broken connections~~ Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections Apr 25, 2022

rajaie-sg closed this as completed Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

rajaie-sg commented Apr 21, 2022 •

edited

Loading

PettitWesley commented Apr 25, 2022

rajaie-sg commented Apr 25, 2022 •

edited

Loading

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 26, 2022 •

edited

Loading

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 26, 2022

matthewfala commented Apr 26, 2022

PettitWesley commented Apr 26, 2022

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 27, 2022

rajaie-sg commented Apr 28, 2022

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

Comments

rajaie-sg commented Apr 21, 2022 • edited Loading

Describe the question/issue

Configuration

Fluent Bit Log Output

Fluent Bit Version Info

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

PettitWesley commented Apr 25, 2022

rajaie-sg commented Apr 25, 2022 • edited Loading

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 26, 2022 • edited Loading

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 26, 2022

matthewfala commented Apr 26, 2022

PettitWesley commented Apr 26, 2022

matthewfala commented Apr 26, 2022

rajaie-sg commented Apr 27, 2022

rajaie-sg commented Apr 28, 2022

rajaie-sg commented Apr 21, 2022 •

edited

Loading

rajaie-sg commented Apr 25, 2022 •

edited

Loading

rajaie-sg commented Apr 26, 2022 •

edited

Loading