Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent Bit latest version (v1.8.9) throws broken pipe and connection errors on high load #4332

Closed
hossain-rayhan opened this issue Nov 16, 2021 · 12 comments

Comments

@hossain-rayhan
Copy link
Contributor

hossain-rayhan commented Nov 16, 2021

Bug Report

Describe the bug
We are doing a performance test to send logs to Amazon Kinesis Firehose using the core kinesis_firehose plugin. We are sending logs at a rate of 20Mb/sec for 600 seconds where each record is ~1Kb. FluentBit versions v1.8.0 and higher seem to throw a lot of Broken pipe and network connectivity issues. However older versions work fine and can send all the logs to Kinesis Firehose without any issue. I can confirm v1.7.5 with an SSL error fix works fine.

  • v1.8.9: throws Broken Pipe and network connection errors.
  • v1.7.5: works fine.

To Reproduce
We are testing on an Amazon Linux 2 machine. Fluent Bit is present on that EC2, and sends logs to a Kinesis Firehose delivery stream in the same AWS account.

Fluent Bit installation: We cloned a specific branch from GitHub and build it following the README guideline. Then we ran it with our config file. We used the following commands:

  1. git clone https://github.com/fluent/fluent-bit.git
  2. cd fluent-bit/build
  3. cmake ..
  4. make
  5. bin/fluent-bit -c /data/fluent-bit.conf

fluent-bit.conf:

[SERVICE]
    Flush           1
    Daemon          off
    Log_Level       debug
    HTTP_Server     On
    HTTP_Listen     0.0.0.0
    HTTP_PORT       2020
    storage.metrics On
[INPUT]
    Name              tail
    Path              /home/ec2-user/fluent-bit/local-run/input/*.log
    refresh_interval  2
    rotate_wait       5
    db                /home/ec2-user/fluent-bit/local-run/db/fluentbit-logs.db
    db.sync           normal
    db.locking        true
    buffer_chunk_size 40MB
    buffer_max_size   1GB
    skip_long_lines   on
    mem_buf_limit     1GB
[OUTPUT]
    name                kinesis_firehose
    match               *
    region              us-west-2
    delivery_stream     fluent-bit-perf-test
    workers             5
    Retry_Limit         5

Expected behavior
FluentBit should sent all data to Kinesis Firehose without any error

Your Environment
Version used: [v1.8.9, v1.7.5]
Environment name and version: Amazon EC2
Operating System and version: Amazon Linux 2
Filters and plugins: tail input plugin, kinesis_firehsoe output plugin.

Error Logs
Error logs from v1.8.9.

[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:37] [ warn] [engine] failed to flush chunk '7061-1636497561.773063216.flb', retry in 7 seconds: task_id=30, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [ warn] [engine] failed to flush chunk '7061-1636497561.809548140.flb', retry in 11 seconds: task_id=35, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [ warn] [engine] failed to flush chunk '7061-1636497562.855397523.flb', retry in 11 seconds: task_id=39, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [error] [upstream] connection #119 to firehose.us-west-2.amazonaws.com:443 timed out after 10 seconds
[2021/11/09 22:39:38] [error] [aws_client] connection initialization error
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:39] [ warn] [engine] failed to flush chunk '7061-1636497563.955488763.flb', retry in 6 seconds: task_id=54, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:40] [ warn] [engine] failed to flush chunk '7061-1636497564.990213728.flb', retry in 9 seconds: task_id=59, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:41] [ warn] [engine] failed to flush chunk '7061-1636497565.24677815.flb', retry in 10 seconds: task_id=64, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:42] [error] [upstream] connection #168 to firehose.us-west-2.amazonaws.com:443 timed out after 10 seconds
[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records

Reproduce Steps:

  1. Build Fluent Bit locally: Run cd fluent-bit/build and make
  2. Run Fluent Bit with the above config: bin/fluent-bit -c /home/ec2-user/fluent-bit/local-run/fluent-bit.conf
  3. In another terminal, run the load generator python script: python3 generate.py > /home/ec2-user/fluent-bit/local-run/input/input.log

generate.py:

import time
LARGE_TEXT="Katherine Stone sphillips@shields.netwuPmcdIIIlPKQddacCTDMHedOKrxhgUOTyVDUjmExZqqwRmGKSwakHiKDMTlQiochemist wuPmcdIIIlPKQddacCTDMHedOKrxhgUOTyVDUjmExZqqwRmGKSwakHiKDMTlQzSvmnNmSgsJkJmtpHDBkICOrKNiNGJYftCIgNuQopxZZMXxGGXLUyCNyuWhzCKUCuuXKhxotmyulQExiufWfjQdiDwbRDUctByhAZcJPrlbGlInbpYcwCAQeJJBOZOEnKlsAOqYtNAueXfAeXFzEtssxZUTVIFTjjlspeJiBggwYuAtwlXzSScLcQNkkFUtCpGZhdPVrpiyNlmdcKkqpIjQIVjRmnnKBvzOSSPvXHLzhOeRzApvmaJtJIkYYLMhftbLioTbnWpXGIzkMzWRVimRrCJkRaqtttLcOiaOekPQgYdrByRPIZMMwfTKdftfZPHKIHJeYryUljCVolxZFdYDihpRHHFJlEwvfViRouHYZPcUihkbnVSkQLGlGzLPHLodovHJjrVqifkdNssyspCGpGHFNSeLguqpWIxMWhJMLkLDizCtqAOzDccveDvowyLrRlEECVGjqrkFTIIIOntAEXgnheqVqLnJaFBBTWcdKdlhzeixnfRZgmorTXKxeHaDDGZAWPhIpiRArxXQkArJdBjCtGNuOoDdNBMgGTLKbbEnsXWUiuEZXELXpbOfJuBSnuaAtUOlTafuSgmoPgMUjBYeaIOplfSzqRNKLnZBzZrzoAmxnfvSosZbDefkeabOAvtrXNFHYdORBJyuzjtpURVmTRzMKjwJhImztWkGcFYEqRpEjfeWfLelRaUNTiLPINZhEfcZagnLPcZrjPAATOVLOiSpwooU F Engineer 3 105000 500 "
# Command: date; time python3 generate.py > /home/ec2-user/fluent-bit/local-run/input/input.log; date
UNIQUE_ID_COUNTER=10000000
i=0
while i < 300: # number of iteration or seconds: 5 minutes
    seconds_since_epoch = time.time()
    j=0
    while j < 20000: # log records generated in every iteration: 20k log records = ~20Mb/s
        print('%d_%.3f_%s' % (UNIQUE_ID_COUNTER, seconds_since_epoch, LARGE_TEXT))
        UNIQUE_ID_COUNTER=UNIQUE_ID_COUNTER+1
        j=j+1
    i=i+1
    time.sleep(1)
@leonardo-albertovich
Copy link
Collaborator

Hi @hossain-rayhan, I'll take a look at this but it'd be good if you could help me out a bit, is there any chance you can contact me in the fluent slack server? That would probably make the reproduction and RCA process way faster.

My username there is Leonardo Almiñana.

Cheers.

@hossain-rayhan
Copy link
Contributor Author

Pinged on slack. Let me know how I can help.

@hossain-rayhan
Copy link
Contributor Author

Following @edsiper suggestion, I also tested with the latest branch 1.8 and observed similar behavior. I can confirm it still happens with the latest codebase.

@hossain-rayhan
Copy link
Contributor Author

@leonardo-albertovich I added reproduce steps in the issue describption. Please check and let me know if it helps. I also tested with the latest branch 1.8 and still observing the same behavior.

@hossain-rayhan
Copy link
Contributor Author

One more observaion is, it works fine when the load is 5 Mb/s. However, when the load is 8Mb/s, it starts to throw the Broken Pipe and connection errors.

@leonardo-albertovich Did you get time to check the reproduce steps?

@lubingfeng
Copy link

@leonardo-albertovich @edsiper this issue is blocking some AWS customers from upgrading to 1.8.x and use multiline support feature. Can you expedite this?

@hossain-rayhan
Copy link
Contributor Author

I tested for 10 minutes as well. Still the same behavior.

Also I don't think it's Firehose specific issue becuse the older Fluent Bit v1.7.5 works fine even for a load up to 50Mb/s.

Also, how do define the debug message [debug] [socket] could not validate socket status for #135 (don't worry)

Error Logs:

[2021/11/23 22:04:35] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/23 22:04:35] [debug] [aws_client] firehose.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2021/11/23 22:04:35] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/23 22:04:35] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/23 22:04:35] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/23 22:04:35] [debug] [socket] could not validate socket status for #135 (don't worry)

@DeepspaceFreelancer
Copy link

Might be related that firehose autoscales, and in the beginning there is just not enough capacity to send it through.

As for why it works for an older version, I do not know. Somehow that one sends it differently or does not reports the broken pipes.

I am also seeing this, however there is not much data loss on the other side. I did capture the data in the firehose and mostly it did came through...

@DeepspaceFreelancer
Copy link

DeepspaceFreelancer commented Dec 1, 2021

I tested for 10 minutes as well. Still the same behavior.

Ok, I am seeing now that you responded to my hurried and then deleted comment.

One can:

  1. Search the logs of the last 5 minutes.
  2. Or see if the data despite the errors come through or not.

And sorry to just talk and not test more deterministically and substantially...

@PettitWesley
Copy link
Contributor

@matthewfala is working on this.

@matthewfala
Copy link
Contributor

We are working on a solution to this issue which you can track also here: aws/aws-for-fluent-bit#288.
Several solutions have been attempted to resolve the problem. And a potential solution is in testing.

If you would like to test the patch which may resolve your issues, please use the following image: 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

If you try out the patch, please let us know if problems are resolved.

There may be some instabilities with the patch, since it is still in testing. If you find any, please let me know and potentially post the debug logs so it can be resolved.

@matthewfala
Copy link
Contributor

The proposed solution above was accepted and merged into Fluent Bit 1.9 #4869. This should resolve broken pipe and connection errors on Kinesis Streams, Firehose, and S3 plugins as well as other, non-aws Fluent Bit plugins.

The solution prioritizes internal Fluent Bit events to complete already started tasks above starting new tasks. This helps keep delay times minimal by reducing the amount of concurrent work.

@PettitWesley This issue can be resolved now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants