Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row hash tag space initialization speed regression #3528

Closed
yoniko opened this issue Mar 7, 2023 · 0 comments · Fixed by #3529 or #3533
Closed

Row hash tag space initialization speed regression #3528

yoniko opened this issue Mar 7, 2023 · 0 comments · Fixed by #3529 or #3533
Assignees

Comments

@yoniko
Copy link
Contributor

yoniko commented Mar 7, 2023

This issue will be used to track the work that has been started in #2971 and #3426 and further work.

Context
Row hash is a fast SIMD-based hash used by various strategies in Zstd.
Other than the normal hash entries it requires an additional space for tags that are hash based and allow further filtration of entries in a bucket.

When streaming data of unknown size (for example, using ZSTD_compressStream) we don't have a good way to choose a hashlog and so we pick a large one. This, in turn, makes it so we need to initialize a large tag space.
This creates a noticeable regression when compressing small inputs.

A few attempts have been made to fix this, #2971 just removes the initialization but is problematic with Valgrind and might introduce another regression due to the consecutive compressions getting "false positives" from previous compressions' tags.

#3426 expands on #2971 and introduces memory regions that have been initialized at least once (thus not triggering Valgrind) and salts the hash to avoid collisions. However, it was rather complex.

Intended solution
Break down #3426 into multiple PRs and possibly remove some of the functionality introduced there.
This is a grandfather issue so we can tie the broken-down PRs together.

@yoniko yoniko self-assigned this Mar 7, 2023
yoniko added a commit to yoniko/zstd that referenced this issue Mar 7, 2023
This helps to avoid regressions where consecutive compressions use the same tag space with similar data (running `zstd -b5e7 enwik8 -B128K` reproduces this regression).
yoniko added a commit to yoniko/zstd that referenced this issue Mar 7, 2023
This helps to avoid regressions where consecutive compressions use the same tag space with similar data (running `zstd -b5e7 enwik8 -B128K` reproduces this regression).
yoniko added a commit to yoniko/zstd that referenced this issue Mar 8, 2023
This helps to avoid regressions where consecutive compressions use the same tag space with similar data (running `zstd -b5e7 enwik8 -B128K` reproduces this regression).
This was linked to pull requests Mar 8, 2023
yoniko added a commit to yoniko/zstd that referenced this issue Mar 9, 2023
This helps to avoid regressions where consecutive compressions use the same tag space with similar data (running `zstd -b5e7 enwik8 -B128K` reproduces this regression).
yoniko added a commit that referenced this issue Mar 13, 2023
- Adds memory type that is guaranteed to have been initialized at least once in the workspace's lifetime.
- Changes tag space in row hash to be based on init once memory.
yoniko added a commit that referenced this issue Mar 13, 2023
Part 2 of #3528

Adds hash salt that helps to avoid regressions where consecutive compressions use the same tag space with similar data (running zstd -b5e7 enwik8 -B128K reproduces this regression).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant