Improved seekable format ingestion speed for small frame size #3544
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As reported by @P-E-Meunier in #2662 (comment), seekable format ingestion speed can be particularly slow when selected
FRAME_SIZE
is very small,especially in combination with the recent row_hash compression mode.
The specific scenario mentioned was
pijul
, using frame sizes of 256 bytes and level 10.This is improved in this PR,
by providing approximate parameter adaptation to the compression state.
Tested locally on a M1 laptop,
ingestion of
enwik8
usingpijul
parameterswent from 35sec. (before this PR) to 2.5sec. (with this PR).
For the specific corner case of a file full of zeroes, this is even more pronounced, going from 45sec. to 0.5sec.
The benefits remain perceptible for other small frame sizes, such as for example 4 KB, where the
enwik8
ingestion test improves from 3.6sec. to 1.8sec., on top of a small compression ratio gain.These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically.
The
seekable_compression
test program has also been updated to allow compression level setting, in order to produce these performance results.