Row hash tag space initialization speed regression #3528

yoniko · 2023-03-07T19:15:03Z

This issue will be used to track the work that has been started in #2971 and #3426 and further work.

Context
Row hash is a fast SIMD-based hash used by various strategies in Zstd.
Other than the normal hash entries it requires an additional space for tags that are hash based and allow further filtration of entries in a bucket.

When streaming data of unknown size (for example, using ZSTD_compressStream) we don't have a good way to choose a hashlog and so we pick a large one. This, in turn, makes it so we need to initialize a large tag space.
This creates a noticeable regression when compressing small inputs.

A few attempts have been made to fix this, #2971 just removes the initialization but is problematic with Valgrind and might introduce another regression due to the consecutive compressions getting "false positives" from previous compressions' tags.

#3426 expands on #2971 and introduces memory regions that have been initialized at least once (thus not triggering Valgrind) and salts the hash to avoid collisions. However, it was rather complex.

Intended solution
Break down #3426 into multiple PRs and possibly remove some of the functionality introduced there.
This is a grandfather issue so we can tie the broken-down PRs together.

The text was updated successfully, but these errors were encountered:

This helps to avoid regressions where consecutive compressions use the same tag space with similar data (running `zstd -b5e7 enwik8 -B128K` reproduces this regression).

- Adds memory type that is guaranteed to have been initialized at least once in the workspace's lifetime. - Changes tag space in row hash to be based on init once memory.

Part 2 of #3528 Adds hash salt that helps to avoid regressions where consecutive compressions use the same tag space with similar data (running zstd -b5e7 enwik8 -B128K reproduces this regression).

yoniko self-assigned this Mar 7, 2023

yoniko mentioned this issue Mar 7, 2023

Introduce salt into row hash (#3528 part 2) #3533

Merged

yoniko mentioned this issue Mar 8, 2023

Add init once memory (#3528) #3529

Merged

This was linked to pull requests Mar 8, 2023

Add init once memory (#3528) #3529

Merged

Introduce salt into row hash (#3528 part 2) #3533

Merged

yoniko mentioned this issue Mar 10, 2023

Reduce RowHash's tag space size by x2 #3543

Merged

yoniko closed this as completed in #3529 Mar 13, 2023

yoniko added a commit that referenced this issue Mar 13, 2023

Add init once memory (#3528) (#3529)

9420bce

- Adds memory type that is guaranteed to have been initialized at least once in the workspace's lifetime. - Changes tag space in row hash to be based on init once memory.

yoniko mentioned this issue Mar 13, 2023

[WIP] fix #2966 part 2 : do not initialize tag space #2971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row hash tag space initialization speed regression #3528

Row hash tag space initialization speed regression #3528

yoniko commented Mar 7, 2023

Row hash tag space initialization speed regression #3528

Row hash tag space initialization speed regression #3528

Comments

yoniko commented Mar 7, 2023