Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288

Closed
PettitWesley opened this issue Jan 25, 2022 · 24 comments
Assignees
Labels
upstream-1.8.x Issues associated with the upstream 1.8.x series

Comments

@PettitWesley
Copy link
Contributor

PettitWesley commented Jan 25, 2022

Upstream issue: fluent/fluent-bit#4332

NOTE: if you are using cloudwatch_logs, please see: #293

We have multiple reports of increased rates of sporadic network connection issues in ALL of the latest AWS releases:

[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/12/06 23:19:19] [error] [aws_client] connection initialization error

Currently, there are no reports that this is a high severity or blocking bug because these errors only happen occasionally and Fluent Bit can retry and usually succeed.

We are currently evaluating a fix but do not have an ETA for release yet.

In the meantime, consider setting auto_retry_requests to true and checking out this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

@PettitWesley PettitWesley added the upstream-1.8.x Issues associated with the upstream 1.8.x series label Jan 25, 2022
@arvin4u123
Copy link

We are seeing this issue alot (few times in a min) so I ll appreciate if you can prioritize this.

2022-01-25T12:45:15.199-08:00 [2022/01/25 20:45:15] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send records to kinesis 2022-01-25T12:45:15.199-08:00 [2022/01/25 20:45:15] [ warn] [engine] failed to flush chunk '1-1643143514.315053223.flb', retry in 11 seconds: task_id=0, input=forward.1 > output=kinesis_streams.1 (out_id=1) 2022-01-25T12:45:35.183-08:00 [2022/01/25 20:45:35] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?

@matthewfala
Copy link
Contributor

Hi @arvin4u123,
This is a problem mainly for high throughput use cases, which you probably have.

If you would like to test the patch which may resolve your issues, please use the following image: 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

If you try out the patch, please let me know if problems are resolved.

There may be some instabilities with the patch, since it is still in testing. If you find any, please let me know and potentially post the debug logs so it can be resolved.

If you would like to explore the patch or build fluent bit on your own, please see the following branch and commit: matthewfala/fluent-bit@f238eee

@arvin4u123
Copy link

Thank you @matthewfala . Is this public repo? I am getting below error while pulling the image

`arsharma @ ~/> $ docker pull 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

Error response from daemon: Head "https://826489191740.dkr.ecr.us-west-2.amazonaws.com/v2/amazon/aws-for-fluent-bit/manifests/1.8.11-1.8-event-loop-priority-r2": no basic auth credentials`

@matthewfala
Copy link
Contributor

It's a private aws ecr with public pulls enabled. Docker pull doesn't work. You should be able to use the image in your task definition though. Are you using AWS ECS to run your project?

@arvin4u123
Copy link

Thank you @matthewfala. I was trying that to make sure we dont run into any permissions issue. Anyways, let me try putting that in our task definition and let you know.

@arvin4u123
Copy link

We just tried this image @matthewfala but still same thing. I see that these error logs started coming up as soon as it starts. Also I dont think we have lot of logs which is causing this issue.

2022-01-26T13:50:19.867-08:00 �[1mFluent Bit v1.8.11�[0m
2022-01-26T13:50:19.867-08:00 * �[1m�[93mCopyright (C) 2019-2021 The Fluent Bit Authors�[0m
2022-01-26T13:50:19.867-08:00 * �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
2022-01-26T13:50:19.867-08:00 * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2022-01-26T13:50:19.868-08:00 * https://fluentbit.io
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [engine] started (pid=1)
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] version=1.1.5, initializing...
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] in-memory
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [cmetrics] version=0.2.2
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
2022-01-26T13:50:19.876-08:00 [2022/01/26 21:50:19] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
2022-01-26T13:50:20.049-08:00 [2022/01/26 21:50:20] [ info] [sp] stream processor started
2022-01-26T13:50:42.564-08:00 [2022/01/26 21:50:42] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send log records to dev-01-ma-api-stream
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send log records
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [error] [output:kinesis_streams:kinesis_streams.1] Failed to send records to kinesis
2022-01-26T13:50:42.565-08:00 [2022/01/26 21:50:42] [ warn] [engine] failed to flush chunk '1-1643233842.348676402.flb', retry in 10 seconds: task_id=0, input=forward.1 > output=kinesis_streams.1 (out_id=1)
2022-01-26T13:52:31.563-08:00 [2022/01/26 21:52:31] [error] [http_client] broken connection to kinesis.us-west-2.amazonaws.com:443 ?

@matthewfala
Copy link
Contributor

Thank you for trying.
Hmm, since I don't see any timed out issues, it could be this other error.

Could you try add the following to your cloudwatch plugin's output?:

[OUTPUT]
    Name              <your plugin>
    # Add the following line
    net.keepalive              off

This may decrease network performance slightly, but I'm curious if it will resolve your issues. This will make a new network connection to aws each time a log batch is sent rather than reusing connections.

If that works, then to increase performance, then maybe try to remove the net.keepalive configuration and add
net.keepalive_max_recycle <10 or some other number, try find max without errors>

@arvin4u123
Copy link

Sure, let me try that but confirming that we need to build the custom image for this?

@PettitWesley
Copy link
Contributor Author

@arvin4u123 it depends on how you are specifying your output config with FireLens, you may need to follow this: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/config-file-type-file

Just start with Matt's image as your base in your custom dockerfile

@PettitWesley
Copy link
Contributor Author

@arvin4u123 You can pull matt's image locally with ECS CLI:

ecs-cli pull --region us-west-2 --registry-id 826489191740 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

@matthewfala
Copy link
Contributor

@arvin4u123,
I'm still trying to find out what portion of the broken pipe errors are solved by the patch and what portion by the keep alive. You may see timeout errors if you use the official image, though maybe not? I recommend trying both if possible. If you can get by with using the official image & adjusted keep alive settings, I would just go with that.

@arvin4u123
Copy link

Thank you @PettitWesley @matthewfala  for your reply. We are using the AWS official image with the latest tag. We dont see the TIMEOUT error in my setup, we see broken connection to kinesis.us-west-2.amazonaws.com:443 ? error. We just tried the 2.16.1 AWS image which has a 1.7.9 fluentbit version and that seems to have resolved the issue. We will monitor more and report back how it goes. 

@arvin4u123
Copy link

Looks like even the older image doesn't hold up for long. We are getting the failed messages even with that

[2022/01/30 17:24:50] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send log records
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send records to kinesis

Also I see that fluentbit image become unhealthy (we have CMD-SHELL echo '{"health": "check"}' | nc 127.0.0.1 8877 || exit 1 as health check for that) very oftenly which kills the task

One more thing which I want to make sure of is that "broken connection to kinesis.us-west-2.amazonaws.com:443" (this is what I am getting) and "timeout" (the one you guys are talking about) are the same? 

@PettitWesley PettitWesley changed the title Network Connection issues in Fluent Bit 1.8.x Async Network Connection issues in Fluent Bit 1.8.x affects {kinesis_firehose, kinesis_streams, s3, and es} Feb 3, 2022
@PettitWesley PettitWesley changed the title Async Network Connection issues in Fluent Bit 1.8.x affects {kinesis_firehose, kinesis_streams, s3, and es} Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs Feb 3, 2022
@matthewfala
Copy link
Contributor

We confirmed that changing the order of processing plugin code in a way that minimizes delay between network call completion and processing the networking response helps mitigate broken pipe and connection errors, especially in high throughput scenarios.

The priority queue solution solution, proposed above, was implemented and merged into Master. It will be featured in Fluent Bit 1.9.0.

fluent/fluent-bit#4869

Hopefully this helps to resolve some of these broken connection and timeout errors.

The solution is only meant to help kinesis, firehose, s3, and es networking but not cloudwatch networking. This is because the solution only resolves issues with async networking but cloudwatch currently relies on sync networking.

We have plans to migrate cloudwatch to async networking which may resolve some unrelated cloudwatch networking issues.

@matthewfala
Copy link
Contributor

The solution is now featured in Fluent Bit version 1.9.

The solution prioritizes internal Fluent Bit events to complete already started tasks above starting new tasks. This helps keep delay times minimal by reducing the amount of concurrent work.

Closing this issue for now. Please reopen if async network connection issues are coming up on firehose, streams, s3 or es.

@scaleteam
Copy link

Thank you @matthewfala for your update. Wonder if you have any ETA for new AWS image with 1.9? Also did it include the same patch as what you have done in internal image i.e 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2? If yes, then as mentioned earlier, that image did not hold up but I ll be more than happy to test it again with latest official image with above fix.

@matthewfala
Copy link
Contributor

Hi @scaleteam, we're waiting until we feel comfortable with the stability of this new major version to create an official aws-for-fluent-bit image.

A few minor bug fixes were added to the patch, what do you mean by not hold up? Segfault, oom, network issues?

@scaleteam
Copy link

Thank you @matthewfala for your reply.
Per my last comment, 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2 was giving below errors

[2022/01/30 17:24:50] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send log records
[2022/01/30 17:24:50] [error] [output:kinesis_streams:kinesis_streams.2] Failed to send records to kinesis

@matthewfala
Copy link
Contributor

Some errors are still expected due to actual broken pipe issues, however the image is supposed to reduce the number of errors heavily. In some high throughput cases we see the broken pipe errors being reduced by 96-100%

@scaleteam
Copy link

Thank you @matthewfala. What is the approx throughput considered as high throughput. Also even we get those msgs, will the msgs be retried considering we have set the Retry_Limit set?

@matthewfala
Copy link
Contributor

40mb/s 1kb logs. Yes, messages will be retried, I think with some exponential backoff mechanism. If you set auto_retry_requests to true, then requests will also be immediately retried, so often the delayed retry can be avoided.

@matthewfala
Copy link
Contributor

You can use the metrics endpoint to monitor how many logs retried, failed, and succeeded. $ curl -s http://127.0.0.1:2020/api/v1/metrics | jq

@scaleteam
Copy link

I dont see reference of auto_retry_requests in https://docs.fluentbit.io/manual/administration/scheduling-and-retries? Also what do you mean by 1kb?

@matthewfala
Copy link
Contributor

Retry limit is for every output plugin. Please see https://docs.fluentbit.io/manual/administration/scheduling-and-retries#configuring-retries

Also, I mean we sent 40,000 1000 character logs per second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream-1.8.x Issues associated with the upstream 1.8.x series
Projects
None yet
Development

No branches or pull requests

4 participants