Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved seekable format ingestion speed for small frame size #3544

Merged
merged 1 commit into from
Mar 10, 2023

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Mar 10, 2023

As reported by @P-E-Meunier in #2662 (comment), seekable format ingestion speed can be particularly slow when selected FRAME_SIZE is very small,
especially in combination with the recent row_hash compression mode.
The specific scenario mentioned was pijul, using frame sizes of 256 bytes and level 10.

This is improved in this PR,
by providing approximate parameter adaptation to the compression state.

Tested locally on a M1 laptop,
ingestion of enwik8 using pijul parameters
went from 35sec. (before this PR) to 2.5sec. (with this PR).
For the specific corner case of a file full of zeroes, this is even more pronounced, going from 45sec. to 0.5sec.

The benefits remain perceptible for other small frame sizes, such as for example 4 KB, where the enwik8 ingestion test improves from 3.6sec. to 1.8sec., on top of a small compression ratio gain.

These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically.

The seekable_compression test program has also been updated to allow compression level setting, in order to produce these performance results.

As reported by @P-E-Meunier in #2662 (comment),
seekable format ingestion speed can be particularly slow
when selected `FRAME_SIZE` is very small,
especially in combination with the recent row_hash compression mode.
The specific scenario mentioned was `pijul`,
using frame sizes of 256 bytes and level 10.

This is improved in this PR,
by providing approximate parameter adaptation to the compression process.

Tested locally on a M1 laptop,
ingestion of `enwik8` using `pijul` parameters
went from 35sec. (before this PR) to 2.5sec (with this PR).
For the specific corner case of a file full of zeroes,
this is even more pronounced, going from 45sec. to 0.5sec.

These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically.

The `seekable_compress` test program has been updated to allows setting compression level,
in order to produce these performance results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants