Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

PettitWesley · 2022-01-25T23:20:45Z

NOTE: if you are using cloudwatch_logs, please see: #293

We have multiple reports of increased rates of sporadic network connection issues in ALL of the latest AWS releases:

[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe

[2021/12/06 23:19:19] [error] [aws_client] connection initialization error

Currently, there are no reports that this is a high severity or blocking bug because these errors only happen occasionally and Fluent Bit can retry and usually succeed.

We are currently evaluating a fix but do not have an ETA for release yet.

In the meantime, consider setting auto_retry_requests to true and checking out this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

The text was updated successfully, but these errors were encountered:

arvin4u123 · 2022-01-26T15:46:50Z

We are seeing this issue alot (few times in a min) so I ll appreciate if you can prioritize this.

2022-01-25T12:45:15.199-08:00 [2022/01/25 20:45:15] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send records to kinesis 2022-01-25T12:45:15.199-08:00 [2022/01/25 20:45:15] [ warn] [engine] failed to flush chunk '1-1643143514.315053223.flb', retry in 11 seconds: task_id=0, input=forward.1 > output=kinesis_streams.1 (out_id=1) 2022-01-25T12:45:35.183-08:00 [2022/01/25 20:45:35] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?

matthewfala · 2022-01-26T19:34:47Z

Hi @arvin4u123,
This is a problem mainly for high throughput use cases, which you probably have.

If you would like to test the patch which may resolve your issues, please use the following image: 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

If you try out the patch, please let me know if problems are resolved.

There may be some instabilities with the patch, since it is still in testing. If you find any, please let me know and potentially post the debug logs so it can be resolved.

If you would like to explore the patch or build fluent bit on your own, please see the following branch and commit: matthewfala/fluent-bit@f238eee

arvin4u123 · 2022-01-26T20:47:05Z

Thank you @matthewfala . Is this public repo? I am getting below error while pulling the image

`arsharma @ ~/> $ docker pull 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

Error response from daemon: Head "https://826489191740.dkr.ecr.us-west-2.amazonaws.com/v2/amazon/aws-for-fluent-bit/manifests/1.8.11-1.8-event-loop-priority-r2": no basic auth credentials`

matthewfala · 2022-01-26T21:21:31Z

It's a private aws ecr with public pulls enabled. Docker pull doesn't work. You should be able to use the image in your task definition though. Are you using AWS ECS to run your project?

arvin4u123 · 2022-01-26T21:29:06Z

Thank you @matthewfala. I was trying that to make sure we dont run into any permissions issue. Anyways, let me try putting that in our task definition and let you know.

arvin4u123 · 2022-01-26T22:21:06Z

We just tried this image @matthewfala but still same thing. I see that these error logs started coming up as soon as it starts. Also I dont think we have lot of logs which is causing this issue.

2022-01-26T13:50:19.867-08:00 �[1mFluent Bit v1.8.11�[0m
2022-01-26T13:50:19.867-08:00 * �[1m�[93mCopyright (C) 2019-2021 The Fluent Bit Authors�[0m
2022-01-26T13:50:19.867-08:00 * �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
2022-01-26T13:50:19.867-08:00 * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2022-01-26T13:50:19.868-08:00 * https://fluentbit.io
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [engine] started (pid=1)
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] version=1.1.5, initializing...
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] in-memory
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [cmetrics] version=0.2.2
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
2022-01-26T13:50:20.049-08:00 [2022/01/26 21:50:20] [ info] [sp] stream processor started
2022-01-26T13:50:42.564-08:00 [2022/01/26 21:50:42] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send log records to dev-01-ma-api-stream
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send log records
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send records to kinesis
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [ warn] [engine] failed to flush chunk '1-1643233842.348676402.flb', retry in 10 seconds: task_id=0, input=forward.1 > output=kinesis_streams.1 (out_id=1)
2022-01-26T13:52:31.563-08:00 [2022/01/26 21:52:31] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?

matthewfala · 2022-01-26T22:27:22Z

Thank you for trying.
Hmm, since I don't see any timed out issues, it could be this other error.

Could you try add the following to your cloudwatch plugin's output?:

[OUTPUT]
    Name              <your plugin>
    # Add the following line
    net.keepalive              off

This may decrease network performance slightly, but I'm curious if it will resolve your issues. This will make a new network connection to aws each time a log batch is sent rather than reusing connections.

If that works, then to increase performance, then maybe try to remove the net.keepalive configuration and add
net.keepalive_max_recycle <10 or some other number, try find max without errors>

arvin4u123 · 2022-01-27T04:36:05Z

Sure, let me try that but confirming that we need to build the custom image for this?

PettitWesley · 2022-01-27T08:35:59Z

@arvin4u123 it depends on how you are specifying your output config with FireLens, you may need to follow this: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/config-file-type-file

Just start with Matt's image as your base in your custom dockerfile

PettitWesley · 2022-01-27T08:36:41Z

@arvin4u123 You can pull matt's image locally with ECS CLI:

ecs-cli pull --region us-west-2 --registry-id 826489191740 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

matthewfala · 2022-01-27T19:55:48Z

@arvin4u123,
I'm still trying to find out what portion of the broken pipe errors are solved by the patch and what portion by the keep alive. You may see timeout errors if you use the official image, though maybe not? I recommend trying both if possible. If you can get by with using the official image & adjusted keep alive settings, I would just go with that.

arvin4u123 · 2022-01-27T20:45:00Z

Thank you @PettitWesley @matthewfala for your reply. We are using the AWS official image with the latest tag. We dont see the TIMEOUT error in my setup, we see broken connection to kinesis.us-west-2.amazonaws.com:443 ? error. We just tried the 2.16.1 AWS image which has a 1.7.9 fluentbit version and that seems to have resolved the issue. We will monitor more and report back how it goes.

arvin4u123 · 2022-01-30T17:41:41Z

Looks like even the older image doesn't hold up for long. We are getting the failed messages even with that

[2022/01/30 17:24:50] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send log records
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send records to kinesis

Also I see that fluentbit image become unhealthy (we have CMD-SHELL echo '{"health": "check"}' | nc 127.0.0.1 8877 || exit 1 as health check for that) very oftenly which kills the task

One more thing which I want to make sure of is that "broken connection to kinesis.us-west-2.amazonaws.com:443" (this is what I am getting) and "timeout" (the one you guys are talking about) are the same?

matthewfala · 2022-03-17T21:07:03Z

We confirmed that changing the order of processing plugin code in a way that minimizes delay between network call completion and processing the networking response helps mitigate broken pipe and connection errors, especially in high throughput scenarios.

The priority queue solution solution, proposed above, was implemented and merged into Master. It will be featured in Fluent Bit 1.9.0.

fluent/fluent-bit#4869

Hopefully this helps to resolve some of these broken connection and timeout errors.

The solution is only meant to help kinesis, firehose, s3, and es networking but not cloudwatch networking. This is because the solution only resolves issues with async networking but cloudwatch currently relies on sync networking.

We have plans to migrate cloudwatch to async networking which may resolve some unrelated cloudwatch networking issues.

matthewfala · 2022-04-05T21:18:43Z

The solution is now featured in Fluent Bit version 1.9.

The solution prioritizes internal Fluent Bit events to complete already started tasks above starting new tasks. This helps keep delay times minimal by reducing the amount of concurrent work.

Closing this issue for now. Please reopen if async network connection issues are coming up on firehose, streams, s3 or es.

scaleteam · 2022-04-12T04:02:46Z

Thank you @matthewfala for your update. Wonder if you have any ETA for new AWS image with 1.9? Also did it include the same patch as what you have done in internal image i.e 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2? If yes, then as mentioned earlier, that image did not hold up but I ll be more than happy to test it again with latest official image with above fix.

matthewfala · 2022-04-12T18:24:26Z

Hi @scaleteam, we're waiting until we feel comfortable with the stability of this new major version to create an official aws-for-fluent-bit image.

A few minor bug fixes were added to the patch, what do you mean by not hold up? Segfault, oom, network issues?

scaleteam · 2022-04-13T14:43:40Z

Thank you @matthewfala for your reply.
Per my last comment, 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2 was giving below errors

[2022/01/30 17:24:50] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send log records
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send records to kinesis

matthewfala · 2022-04-13T18:46:15Z

Some errors are still expected due to actual broken pipe issues, however the image is supposed to reduce the number of errors heavily. In some high throughput cases we see the broken pipe errors being reduced by 96-100%

scaleteam · 2022-04-13T18:52:55Z

Thank you @matthewfala. What is the approx throughput considered as high throughput. Also even we get those msgs, will the msgs be retried considering we have set the Retry_Limit set?

matthewfala · 2022-04-13T18:57:00Z

40mb/s 1kb logs. Yes, messages will be retried, I think with some exponential backoff mechanism. If you set auto_retry_requests to true, then requests will also be immediately retried, so often the delayed retry can be avoided.

matthewfala · 2022-04-13T18:59:32Z

You can use the metrics endpoint to monitor how many logs retried, failed, and succeeded. $ curl -s http://127.0.0.1:2020/api/v1/metrics | jq

scaleteam · 2022-04-13T19:37:58Z

I dont see reference of auto_retry_requests in https://docs.fluentbit.io/manual/administration/scheduling-and-retries? Also what do you mean by 1kb?

matthewfala · 2022-04-13T19:40:02Z

Retry limit is for every output plugin. Please see https://docs.fluentbit.io/manual/administration/scheduling-and-retries#configuring-retries

Also, I mean we sent 40,000 1000 character logs per second.

PettitWesley assigned matthewfala Jan 25, 2022

PettitWesley added the upstream-1.8.x Issues associated with the upstream 1.8.x series label Jan 25, 2022

matthewfala mentioned this issue Jan 27, 2022

Fluent Bit latest version (v1.8.9) throws broken pipe and connection errors on high load fluent/fluent-bit#4332

Closed

PettitWesley changed the title ~~Network Connection issues in Fluent Bit 1.8.x~~ Async Network Connection issues in Fluent Bit 1.8.x affects {kinesis_firehose, kinesis_streams, s3, and es} Feb 3, 2022

PettitWesley changed the title ~~Async Network Connection issues in Fluent Bit 1.8.x affects {kinesis_firehose, kinesis_streams, s3, and es}~~ Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs Feb 3, 2022

PettitWesley mentioned this issue Feb 3, 2022

cloudwatch_logs (sync io) Network Connection issues in Fluent Bit 1.8.x #293

Closed

matthewfala closed this as completed Apr 5, 2022

rajaie-sg mentioned this issue Apr 21, 2022

Fluentbit Kinesis Firehose Throttled and Failing to send logs, broken connections #334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

PettitWesley commented Jan 25, 2022 •

edited

Loading

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 27, 2022

PettitWesley commented Jan 27, 2022

PettitWesley commented Jan 27, 2022

matthewfala commented Jan 27, 2022

arvin4u123 commented Jan 27, 2022

arvin4u123 commented Jan 30, 2022

matthewfala commented Mar 17, 2022

matthewfala commented Apr 5, 2022

scaleteam commented Apr 12, 2022

matthewfala commented Apr 12, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

matthewfala commented Apr 13, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

Comments

PettitWesley commented Jan 25, 2022 • edited Loading

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

arvin4u123 commented Jan 26, 2022

matthewfala commented Jan 26, 2022

arvin4u123 commented Jan 27, 2022

PettitWesley commented Jan 27, 2022

PettitWesley commented Jan 27, 2022

matthewfala commented Jan 27, 2022

arvin4u123 commented Jan 27, 2022

arvin4u123 commented Jan 30, 2022

matthewfala commented Mar 17, 2022

matthewfala commented Apr 5, 2022

scaleteam commented Apr 12, 2022

matthewfala commented Apr 12, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

matthewfala commented Apr 13, 2022

scaleteam commented Apr 13, 2022

matthewfala commented Apr 13, 2022

PettitWesley commented Jan 25, 2022 •

edited

Loading