-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Network Connection issues in Fluent Bit 1.8.x in kinesis_firehose, kinesis_streams, s3, and es outputs #288
Comments
We are seeing this issue alot (few times in a min) so I ll appreciate if you can prioritize this.
|
Hi @arvin4u123, If you would like to test the patch which may resolve your issues, please use the following image: If you try out the patch, please let me know if problems are resolved. There may be some instabilities with the patch, since it is still in testing. If you find any, please let me know and potentially post the debug logs so it can be resolved. If you would like to explore the patch or build fluent bit on your own, please see the following branch and commit: matthewfala/fluent-bit@f238eee |
Thank you @matthewfala . Is this public repo? I am getting below error while pulling the image `arsharma @ ~/> $ docker pull 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2 Error response from daemon: Head "https://826489191740.dkr.ecr.us-west-2.amazonaws.com/v2/amazon/aws-for-fluent-bit/manifests/1.8.11-1.8-event-loop-priority-r2": no basic auth credentials` |
It's a private aws ecr with public pulls enabled. Docker pull doesn't work. You should be able to use the image in your task definition though. Are you using AWS ECS to run your project? |
Thank you @matthewfala. I was trying that to make sure we dont run into any permissions issue. Anyways, let me try putting that in our task definition and let you know. |
We just tried this image @matthewfala but still same thing. I see that these error logs started coming up as soon as it starts. Also I dont think we have lot of logs which is causing this issue. 2022-01-26T13:50:19.867-08:00 �[1mFluent Bit v1.8.11�[0m |
Thank you for trying. Could you try add the following to your cloudwatch plugin's output?:
This may decrease network performance slightly, but I'm curious if it will resolve your issues. This will make a new network connection to aws each time a log batch is sent rather than reusing connections. If that works, then to increase performance, then maybe try to remove the net.keepalive configuration and add |
Sure, let me try that but confirming that we need to build the custom image for this? |
@arvin4u123 it depends on how you are specifying your output config with FireLens, you may need to follow this: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/config-file-type-file Just start with Matt's image as your base in your custom dockerfile |
@arvin4u123 You can pull matt's image locally with ECS CLI:
|
@arvin4u123, |
Thank you @PettitWesley @matthewfala for your reply. We are using the AWS official image with the latest tag. We dont see the TIMEOUT error in my setup, we see |
Looks like even the older image doesn't hold up for long. We are getting the failed messages even with that [2022/01/30 17:24:50] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe Also I see that fluentbit image become unhealthy (we have CMD-SHELL echo '{"health": "check"}' | nc 127.0.0.1 8877 || exit 1 as health check for that) very oftenly which kills the task One more thing which I want to make sure of is that "broken connection to kinesis.us-west-2.amazonaws.com:443" (this is what I am getting) and "timeout" (the one you guys are talking about) are the same? |
We confirmed that changing the order of processing plugin code in a way that minimizes delay between network call completion and processing the networking response helps mitigate broken pipe and connection errors, especially in high throughput scenarios. The priority queue solution solution, proposed above, was implemented and merged into Master. It will be featured in Fluent Bit 1.9.0. Hopefully this helps to resolve some of these broken connection and timeout errors. The solution is only meant to help kinesis, firehose, s3, and es networking but not cloudwatch networking. This is because the solution only resolves issues with async networking but cloudwatch currently relies on sync networking. We have plans to migrate cloudwatch to async networking which may resolve some unrelated cloudwatch networking issues. |
The solution is now featured in Fluent Bit version 1.9. The solution prioritizes internal Fluent Bit events to complete already started tasks above starting new tasks. This helps keep delay times minimal by reducing the amount of concurrent work. Closing this issue for now. Please reopen if async network connection issues are coming up on firehose, streams, s3 or es. |
Thank you @matthewfala for your update. Wonder if you have any ETA for new AWS image with |
Hi @scaleteam, we're waiting until we feel comfortable with the stability of this new major version to create an official aws-for-fluent-bit image. A few minor bug fixes were added to the patch, what do you mean by not hold up? Segfault, oom, network issues? |
Thank you @matthewfala for your reply.
|
Some errors are still expected due to actual broken pipe issues, however the image is supposed to reduce the number of errors heavily. In some high throughput cases we see the broken pipe errors being reduced by 96-100% |
Thank you @matthewfala. What is the approx throughput considered as high throughput. Also even we get those msgs, will the msgs be retried considering we have set the |
40mb/s 1kb logs. Yes, messages will be retried, I think with some exponential backoff mechanism. If you set |
You can use the metrics endpoint to monitor how many logs retried, failed, and succeeded. |
I dont see reference of |
Retry limit is for every output plugin. Please see https://docs.fluentbit.io/manual/administration/scheduling-and-retries#configuring-retries Also, I mean we sent 40,000 1000 character logs per second. |
Upstream issue: fluent/fluent-bit#4332
NOTE: if you are using cloudwatch_logs, please see: #293
We have multiple reports of increased rates of sporadic network connection issues in ALL of the latest AWS releases:
Currently, there are no reports that this is a high severity or blocking bug because these errors only happen occasionally and Fluent Bit can retry and usually succeed.
We are currently evaluating a fix but do not have an ETA for release yet.
In the meantime, consider setting
auto_retry_requests
to true and checking out this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issuesThe text was updated successfully, but these errors were encountered: