Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbedtls upgrade from 2.24.0 to 2.26.0 causes significant performance regression in mbedtls_base64_encode #4110

Closed
krispraws opened this issue Sep 22, 2021 · 17 comments
Assignees
Labels

Comments

@krispraws
Copy link
Contributor

krispraws commented Sep 22, 2021

Bug Report

Describe the bug
This was discovered while trying to root-cause why my workflow that reads log files using tail input and kinesis.firehose had consistent connection timeouts, slow DNS lookups, segmentation faults and general performance issues when trying to process even 500 KB/s. The performance issues only showed up for Fluent Bit versions 1.8.1 or higher. I tested the same setup with v1.7.5 and was able to run it without errors. Some more details in: #4107

I profiled the current master branch code and found that the bulk of time was spent in mbedtls_base64_cond_assign_uchar which is called via firehose_api.c:process_event->mbedtls_base64_encode->mbedtls_base64_table_lookup->mbedtls_base64_cond_assign_uchar

Firehose plugin uses mbedtls to base64 encode the request to Firehose and not for certificate parsing or TLS functionality.

When the engine flushes a task to the firehose output plugin, it reads the input data, converts them to JSON and then uses base64 to encode it in the request for Kinesis Firehose. The firehose plugin then tries to create a connection (using ASYNC IO) and it times out the majority of the time.

Some googling led me to Mbed-TLS/mbedtls#4814 and this PR for it: Mbed-TLS/mbedtls#4819

Downgrading the version back to 2.24.0 leads to dramatically better performance. But it is only a workaround. Has anyone else seen similar performance issues? Do you have suggestions on the best solution?

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
    The logs mainly show a big burst of new connections created and then most of them timing out.
[2021/09/21 01:51:10] [debug] [output:kinesis_firehose:kinesis_firehose.0] Sending 500 records
[2021/09/21 01:51:10] [debug] [output:kinesis_firehose:kinesis_firehose.0] Sending log records to delivery stream PUT-S3-4
vsGw
[2021/09/21 01:51:10] [error] [upstream] connection #-1 to firehose.us-east-1.amazonaws.com:443 timed out after 10 seconds
[2021/09/21 01:51:10] [error] [upstream] connection #-1 to firehose.us-east-1.amazonaws.com:443 timed out after 10 seconds
[2021/09/21 01:51:10] [error] [upstream] connection #-1 to firehose.us-east-1.amazonaws.com:443 timed out after 10 seconds
[2021/09/21 01:51:10] [error] [upstream] connection #-1 to firehose.us-east-1.amazonaws.com:443 timed out after 10 seconds
  • Steps to reproduce the problem:
    Any workflow that sends 500 or more 1KB log record per second to fluent-bit using a tail--kinesis.firehose will trigger this.

Expected behavior
The performance of the plugin should be consistent (or improve) over Fluent Bit version upgrades for the same workload.

Screenshots

Screen Shot 2021-09-21 at 6 12 01 PM

Your Environment

  • Version used: 1.9.0 (master)
  • Configuration:
[SERVICE]
    Flush           1
    Daemon          off
    Log_Level       debug

[INPUT]
    Name              tail
    Path              /home/ec2-user/flb-data/input_data/*.log
    Path_Key          input_file  
    Offset_Key        input_offset  
    refresh_interval  2
    rotate_wait       5
    db                /home/ec2-user/flb-data/input_db/fluentbit-logs.db
    db.sync           normal
    db.locking        true
    buffer_chunk_size 1MB
    buffer_max_size   50MB
    skip_long_lines   on
    mem_buf_limit     199800KB

[OUTPUT]
    name              kinesis_firehose
    match             *
    region            us-east-1
    delivery_stream   <redacted>
    role_arn          <redacted>            
    Retry_Limit       3
  • Environment name and version (e.g. Kubernetes? What version?): EC2
  • Server type and version: t3.xlarge
  • Operating System and version: Amazon Linux 2
  • Filters and plugins: tail, kinesis.firehose

Additional context
Firehose plugin uses mbedtls to base64 encode the request to Firehose and not for certificate parsing or TLS functionality.

@krispraws
Copy link
Contributor Author

I see mbedtls_base64_encode being used in multiple output plugins: https://github.com/fluent/fluent-bit/search?q=mbedtls_base64_encode

@edsiper
Copy link
Member

edsiper commented Sep 22, 2021

thanks for pointing this out.

We should definitely get rid of mbedtls, actually is only used for hashing kind functions. Since we use OpenSSL, I think is fair enough to write our own wrappers

@krispraws
Copy link
Contributor Author

krispraws commented Sep 22, 2021

thanks for pointing this out.

We should definitely get rid of mbedtls, actually is only used for hashing kind functions. Since we use OpenSSL, I think is fair enough to write our own wrappers

We (firehose) were using it for base64 encoding only. Do you have suggestions on a better library that all plugins can use for non-ssl related base64 encoding? Or should we add an implementation and check it in as source?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 23, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@PettitWesley
Copy link
Contributor

@edsiper Do you recommend using openssl for base64 encoding instead?

@edsiper
Copy link
Member

edsiper commented Oct 29, 2021

@PettitWesley I am not an expert on OpenSSL, but doing a quick look it looks like a very complex API for a simple task. But if that gives you someX performance, worth trying I would say (we could wrap it)

@github-actions github-actions bot removed the Stale label Oct 30, 2021
@krispraws
Copy link
Contributor Author

I am also not an OpenSSL expert, but I think using OpenSSL will cause similar issues in future when openssl is upgraded. The change in mbedtls was introduced to fix a security vulnerability with SSL. I think for the AWS output plugins, we should roll our own implementation using the AWS SDK implementation as a reference

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 2, 2021
@PettitWesley PettitWesley removed the Stale label Dec 2, 2021
@PettitWesley PettitWesley self-assigned this Dec 2, 2021
@PettitWesley
Copy link
Contributor

We still need to work on this

@matthewfala
Copy link
Contributor

Here's a PR to rebrand mbedtls-2.24.0's performant base64 utility as flb_aws_base64_<encode/decode>
#4110

Once this is merged, we can remove the aws specific cherry-pick https://github.com/zhonghui12/fluent-bit.git custom-1.8.7 30fc630

@PettitWesley
Copy link
Contributor

@matthewfala PR link is wrong

@PettitWesley
Copy link
Contributor

@krispraws What tool is the screenshot of in your report?

@krispraws
Copy link
Contributor Author

The tool in the screenshot is just a visualizer for the callgrind output. I used qcachegrind - you can use anything.
The report is generated with

valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf

@PettitWesley
Copy link
Contributor

Need upstream review: #4422

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Apr 13, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants