-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kinesis_firehose: Crashing, log loss/duplication #3917
Comments
FYI @PettitWesley |
@pranavmarla This is very surprising. The new Can you try enabling Can you also please open an issue referencing this issue at the AWS repo, that is more or less a requirement to get AWS engineers to look into an issue: https://github.com/aws/aws-for-fluent-bit |
Note: On v1.7.9, network back end was updated.
DNS back end was changed from v1.7.6 |
Thanks for checking in @PettitWesley ! I re-ran the test with the core Fluent Bit Docker image (v
|
@edsiper These crashes look like they might be core networking issues... do any of these reports look like other issues you've seen in other plugins? |
Hello, A similar error occured with
Configuration :
|
This problem is getting interesting. I tested with the exact same setup as described in this issue. I also got the similar warning message However, fluent-bit was able to send logs successfully and did not crash. I kept it running for around 20 minutes with different payloads. I tested with the same image
|
Hi @edsiper, I tested this with multiple workload size. it can easily handle a lot of 2MB to 5 MB/second. However it starts crashing randomly when the the load is increased to ~10MB/second. I used the exact same config and data generator from this issue. My environment was Amazon Linux 2. I believe something is wrong with our core networking module.
|
I have the same issue. |
Hi @edsiper , we are getting similar reports from our users with Splunk. The setup is running fBit
Any update/insight on this issue would be appreciated Thanks. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@pranavmarla @hossain-rayhan This issue can be closed now/soon? We fixed the issues in our latest releases? |
Mostly we tested with v1.7.5. I haven't tested with the latest realese yet. I also have another open issue #4040. I will test and update on that issue. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@hossain-rayhan @pranavmarla Is this one okay to let it close automatically? |
I have two other open issues for tracking the broken pipe and connection errors with Firehose plugin on high load. Some of the concerns with v1.8.+ might have been fixed by upstream (as I don't see any segfault or crash). @pranavmarla can tell wheater he wants to keep it open and track. |
Thanks @PettitWesley and @hossain-rayhan -- as long as the remaining issues are being tracked, it should be fine to let this close |
Cool. I will close it. (The stale label will be ignored because we commented). |
Bug Report
Describe the bug
We are currently doing performance testing, sending a burst of 25,000 logs from Fluent Bit to Kinesis Firehose (via the core
kinesis_firehose
plugin), and Fluent Bit seems to be consistently experiencing issues sending this many logs to Firehose, ranging from dropping logs to outright crashing -- worryingly, the issues get worse with newer versions of Fluent Bit.Specifically:
1.8.0+
: Crashes within 20 seconds (segmentation fault); loses logs (only manages to send a fraction of the logs before crashing)1.7.6+
: Doesn't crash, but log delivery is inconsistent -- sometimes loses logs, sometimes sends more logs (i.e. sends the same log multiple times, presumably caused by Fluent Bit's retry attempts)1.7.5
: Doesn't crash and doesn't lose logs, but seems to always send more/duplicate logs(See below for more details).
Note that, if we switch to Amazon's Fluent Bit image (and use Amazon's
firehose
plugin instead of the corekinesis_firehose
plugin), all these issues go away. Specifically:Instead, it always sends the exact number of logs that were generated.
So, the issue seems to be with the core
kinesis_firehose
plugin specifically.To Reproduce
Our testing is being done on a large Ubuntu EC2 instance. Fluent Bit is present on that EC2, and sends logs to a Kinesis Firehose delivery stream in the same AWS account. To avoid proxy issues, we have created a VPC endpoint for Firehose, so that we can directly send logs from EC2 to Firehose.
We have a file called
/data/perf-test/fakeData.txt
containing fake data/logs, where each log is ~1,000 bytes in size.Eg.:
/data/perf-test/runtest.sh
), which essentially reads a certain number of logs per second from/data/perf-test/fakeData.txt
and writes them to/data/perf-test/logFolder-fb/test.log
, from where Fluent Bit tails them and sends them to Firehose:Thus, to have the above script generate 25,000 logs (5,000 logs/second * 5 seconds) for Fluent Bit to read, we run the following command:
Expected behavior
Since we generated 25,000 logs to a file being tailed by Fluent Bit, we expect Fluent Bit to send exactly 25,000 logs to Firehose. Instead, as mentioned above, depending on which version of (core) Fluent Bit we use, it either crashes, loses logs or sends more/duplicate logs.
If we switch to Amazon's Fluent Bit image and Amazon's
firehose
plugin (i.e. replacename kinesis_firehose
in the above Fluent Bit config withname firehose
), then all the issues go away and Fluent Bit behaves as expected -- it sends exactly 25,000 logs to Firehose.Error Logs
Here are the logs generated by Fluent Bit, for some of the versions we tested:
1.8.3
: Crashes1.8.2
: Crashes1.7.6
: Does not crash, but log delivery is inconsistent: Sometimes loses logs, sometimes sends extra/duplicate logs1.7.5
: Does not crash, does not lose logs, but does send extra/duplicate logs2.19.0
(containing core Fluent Bit v1.8.3
) with Amazonfirehose
plugin: Works as expected -- no crashing, no log loss, no extra/duplicate logsYour Environment
Core Fluent Bit Docker image versions tested:
1.8.3-debug
1.8.2-debug
1.8.0-debug
1.7.9-debug
1.7.8-debug
1.7.7-debug
1.7.6-debug
1.7.5-debug
Amazon Fluent Bit Docker image versions tested:
2.19.0
(contains Fluent Bit v1.8.3
)The text was updated successfully, but these errors were encountered: