-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mbedtls upgrade from 2.24.0 to 2.26.0 causes significant performance regression in mbedtls_base64_encode #4110
Comments
I see |
thanks for pointing this out. We should definitely get rid of mbedtls, actually is only used for hashing kind functions. Since we use OpenSSL, I think is fair enough to write our own wrappers |
We (firehose) were using it for base64 encoding only. Do you have suggestions on a better library that all plugins can use for non-ssl related base64 encoding? Or should we add an implementation and check it in as source? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
@edsiper Do you recommend using openssl for base64 encoding instead? |
@PettitWesley I am not an expert on OpenSSL, but doing a quick look it looks like a very complex API for a simple task. But if that gives you someX performance, worth trying I would say (we could wrap it) |
I am also not an OpenSSL expert, but I think using OpenSSL will cause similar issues in future when openssl is upgraded. The change in mbedtls was introduced to fix a security vulnerability with SSL. I think for the AWS output plugins, we should roll our own implementation using the AWS SDK implementation as a reference |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
We still need to work on this |
Here's a PR to rebrand mbedtls-2.24.0's performant base64 utility as flb_aws_base64_<encode/decode> Once this is merged, we can remove the aws specific cherry-pick https://github.com/zhonghui12/fluent-bit.git custom-1.8.7 30fc630 |
@matthewfala PR link is wrong |
@krispraws What tool is the screenshot of in your report? |
The tool in the screenshot is just a visualizer for the callgrind output. I used qcachegrind - you can use anything.
|
Need upstream review: #4422 |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
Bug Report
Describe the bug
This was discovered while trying to root-cause why my workflow that reads log files using
tail
input andkinesis.firehose
had consistent connection timeouts, slow DNS lookups, segmentation faults and general performance issues when trying to process even 500 KB/s. The performance issues only showed up for Fluent Bit versions 1.8.1 or higher. I tested the same setup with v1.7.5 and was able to run it without errors. Some more details in: #4107I profiled the current
master
branch code and found that the bulk of time was spent inmbedtls_base64_cond_assign_uchar
which is called viafirehose_api.c:process_event
->mbedtls_base64_encode
->mbedtls_base64_table_lookup
->mbedtls_base64_cond_assign_uchar
Firehose plugin uses mbedtls to base64 encode the request to Firehose and not for certificate parsing or TLS functionality.
When the engine flushes a task to the firehose output plugin, it reads the input data, converts them to JSON and then uses base64 to encode it in the request for Kinesis Firehose. The firehose plugin then tries to create a connection (using ASYNC IO) and it times out the majority of the time.
Some googling led me to Mbed-TLS/mbedtls#4814 and this PR for it: Mbed-TLS/mbedtls#4819
Downgrading the version back to 2.24.0 leads to dramatically better performance. But it is only a workaround. Has anyone else seen similar performance issues? Do you have suggestions on the best solution?
To Reproduce
The logs mainly show a big burst of new connections created and then most of them timing out.
Any workflow that sends 500 or more 1KB log record per second to fluent-bit using a
tail
--kinesis.firehose
will trigger this.Expected behavior
The performance of the plugin should be consistent (or improve) over Fluent Bit version upgrades for the same workload.
Screenshots
Your Environment
tail
,kinesis.firehose
Additional context
Firehose plugin uses mbedtls to base64 encode the request to Firehose and not for certificate parsing or TLS functionality.
The text was updated successfully, but these errors were encountered: