decompression: worse throughput when using `tower_http::decompression` than manual impl with `async-compression` #520

magurotuna · 2024-09-22T15:26:03Z

I have looked for existing issues (including closed) about this

Bug Report

I've seen in some situations the throughput of decompression gets significantly worse when using tower_http::decompression compared to manually implementing a similar logic with async-compression crate.

Version

Platform

Apple silicon macOS

(Not 100% sure, but should happen in Linux as well)

Description

In Deno, we switched the inner implementation of fetch (JavaScript API) from reqwest based to hyper-util based.

denoland/deno#24593

In the hyper-util based implementation, it uses tower_http::decompression to decompress the fetched data if necessary. Note here that reqwest doesn't use tower_http.

After this change, we started to see the throughput to be degraded especially when the server serves compressed large data. Looks at the following graph, showing how long each Deno version takes to 2k requests where it fetches compressed data from the upstream server and then forwards it the end client.

v1.45.2 is before we switched to hyper-based fetch implementation. Since v1.45.3 when we landed it, the throughput got 10x worse.

Then I identified that tower_http::decompression causes this issue, and figured out that if we implement a decompression logic by directly using the async-compression crate, the performance gets back to what it was. (see denoland/deno#25800 for how manual implementation with async-compression affects the performance)

You can find how I performed the benchmark at https://github.com/magurotuna/deno_fetch_decompression_throughput

The text was updated successfully, but these errors were encountered:

Currently, every time `WrapBody::poll_frame` is called, new instance of `BytesMut` is created with the default capacity, which is effectively 64 bytes. This ends up with a lot of memory allocation in certain situations, making the throughput significantly worse. To optimize memory allocation, `WrapBody` now gets `BytesMut` as its field, with initial capacity of 4096 bytes. This buffer will be reused as much as possible across multiple `poll_frame` calls, and only when its capacity becomes 0, new allocation of another 4096 bytes is performed. Fixes: tower-rs#520

magurotuna mentioned this issue Sep 22, 2024

(de)compression: reduce memory allocation to improve performance #521

Merged

seanmonstar closed this as completed in #521 Sep 23, 2024

seanmonstar closed this as completed in 9fdf0eb Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decompression: worse throughput when using `tower_http::decompression` than manual impl with `async-compression` #520

decompression: worse throughput when using `tower_http::decompression` than manual impl with `async-compression` #520

magurotuna commented Sep 22, 2024

decompression: worse throughput when using tower_http::decompression than manual impl with async-compression #520

decompression: worse throughput when using tower_http::decompression than manual impl with async-compression #520

Comments

magurotuna commented Sep 22, 2024

Bug Report

Version

Platform

Description

decompression: worse throughput when using `tower_http::decompression` than manual impl with `async-compression` #520

decompression: worse throughput when using `tower_http::decompression` than manual impl with `async-compression` #520