Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

embg · 2022-04-19T20:17:40Z

Summary

Using similar techniques as #2749 and #3086, improves level-1 compression speed by 4-9% on a dataset of HTML headers between 8 and 16KB (the dataset has ratio 3.03 w/o dictionary, 4.92 with a 110KB dictionary). Dictionary compression speed improvements hold across gcc/clang and a variety of dictionary temperatures and sizes. Additionally, compression speed for level-1 streaming compression (i.e. zstd --single-thread silesia.tar) is improved by 4-5%.

Compression ratio is also slightly improved. For example, the compressed size of silesia.tar is reduced by 0.1%.

Preliminary results

I have added final measurements below, preliminary measurements are still here if you are interested:

Dictionary compression

I used the experimental setup described in #3086 (see the section titled "Final Results"). On the 4KB dictionary, speed improvements are 4-5% across all temperatures and compilers. On the larger dictionaries (more realistic), speed improvements are 6-9% across all temperatures and compilers.

Dictionary compression speed improvements at level 1 (preliminary)

/Users/embg/results0415/html_16K/cold/4K
	gcc11: 5.4% (± 0.0%)
	clang12: 4.3% (± 0.1%)

/Users/embg/results0415/html_16K/cold/32K
	gcc11: 8.5% (± 0.0%)
	clang12: 6.2% (± 0.1%)

/Users/embg/results0415/html_16K/cold/110K
	gcc11: 9.1% (± 0.1%)
	clang12: 6.6% (± 0.1%)

/Users/embg/results0415/html_16K/warm/4K
	gcc11: 5.6% (± 0.1%)
	clang12: 4.5% (± 0.1%)

/Users/embg/results0415/html_16K/warm/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.4% (± 0.2%)

/Users/embg/results0415/html_16K/warm/110K
	gcc11: 8.2% (± 0.9%)
	clang12: 6.9% (± 1.0%)

/Users/embg/results0415/html_16K/hot/4K
	gcc11: 5.8% (± 0.1%)
	clang12: 4.6% (± 0.1%)

/Users/embg/results0415/html_16K/hot/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.1% (± 0.1%)

/Users/embg/results0415/html_16K/hot/110K
	gcc11: 7.9% (± 0.1%)
	clang12: 5.7% (± 0.1%)

There were small improvements in compression ratio at all dictionary sizes. The compressed size of the full html16K dataset was reduced as follows: 0.18% with the 110KB dictionary, 0.22% with the 32KB dictionary, and 0.52% with the 4KB dictionary.

Streaming compression

Compression speed was measured via time ./zstd_bin -1 --single-thread silesia.tar. I timed each binary a few times interleaved and took the best time for each one. I tested the same compilers as for dictionary compression (gcc11 and clang12) and used the same setup (same CPU, core isolation, turbo disabled).

	`460780f` (dev)	`3536262` (this PR)	% improvement
gcc11	1.49 (s) user	1.41 (s) user	5.4%
clang12	1.47 (s) user	1.41 (s) user	4.1%

The compressed size of silesia.tar is reduced by 0.17%.

Final results

Dictionary compression

I used the experimental setup described in #3086 (see the section titled "Final Results"). The final dictionary perf numbers are very close to the preliminary numbers, except the clang win is about 1% higher now on most scenarios. Gcc numbers are almost identical.

My follow-up commits on this PR did not touch ratio so that is unchanged. See the Preliminary Results section for a discussion of the ratio impacts.

Dictionary compression speed improvements at level 1 (final)

results0502/html_16K/cold/4K
	gcc11: 5.5% (± 0.1%)
	clang12: 5.4% (± 0.1%)

results0502/html_16K/cold/32K
	gcc11: 8.5% (± 0.1%)
	clang12: 6.7% (± 0.1%)

results0502/html_16K/cold/110K
	gcc11: 9.2% (± 0.1%)
	clang12: 7.2% (± 0.1%)

results0502/html_16K/warm/4K
	gcc11: 5.6% (± 0.1%)
	clang12: 5.6% (± 0.1%)

results0502/html_16K/warm/32K
	gcc11: 8.5% (± 0.2%)
	clang12: 6.7% (± 0.2%)

results0502/html_16K/warm/110K
	gcc11: 8.5% (± 0.7%)
	clang12: 7.8% (± 1.3%)

results0502/html_16K/hot/4K
	gcc11: 5.7% (± 0.1%)
	clang12: 5.5% (± 0.0%)

results0502/html_16K/hot/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.3% (± 0.2%)

results0502/html_16K/hot/110K
	gcc11: 8.0% (± 0.1%)
	clang12: 5.9% (± 0.1%)

Streaming compression

User time numbers are for zstd <LEVEL> --single-thread /mnt/ramdisk/silesia.tar -o /dev/null

I used core isolation and disabled turbo. Benchmarking method was roughly “run many many times interleaved, take the lowest number that I’ve seen at least twice for each binary”. Speed variations were around 3% due to background load on the server.

Everything is speed AND ratio neutral/positive except fast level 3, which trades 2.7% speed regression for 2.4% ratio improvement (very good tradeoff IMO). Also worth noting that fast level 2 has a significant ratio improvement for no cost in speed. Fast level 1 has a significant speed improvement but no meaningful change in ratio.

Level:	460780 gcc11	ac371b gcc11	460780 clang12	ac371b clang12
-1	1.49 (s), 34.69% ratio	1.41 (s), 34.63% ratio	1.45 (s), 34.69% ratio	1.39 (s), 34.63% ratio
--fast=1	1.35 (s), 41.10% ratio	1.29 (s), 41.04% ratio	1.33 (s), 41.10% ratio	1.27 (s), 41.04% ratio
--fast=2	1.18 (s), 43.56% ratio	1.19 (s), 42.68% ratio	1.17 (s), 43.56% ratio	1.17 (s), 42.68% ratio
--fast=3	1.08 (s), 45.77% ratio	1.11 (s), 44.66% ratio	1.06 (s), 45.77% ratio	1.10 (s), 44.66% ratio

What didn't work

A simpler pipeline with only two positions (ip0/ip1). This is what I tried initially, following my DMS pipeline in Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) #3086. gcc wins were well over 5%, but unfortunately clang speed regressed a few %. I tried a couple approaches to fix the clang regression (documented here), but none of them worked. In the end, I decided to rewrite the pipeline closer to @felixhandte's noDict implementation, and this fixed the clang regression.
Optimizing backwards match length calculation. The current implementation calculates backwards match length byte by byte. This creates a ton of branch mispredictions whenever the backwards match length is non-zero. I tried to rewrite this code to look back 8 bytes at a time, reducing branch mispredictions. Unfortunately, the increased overhead seems to have slightly outweighed the benefit from better branch prediction, with speed regressions around 1%. My commits are saved here.

Future work

Compression speed could be optimized for cold dictionaries by prefetching the dictionary content starting from dictEnd. The hash table is warm because this is extDict, but in a cold dictionary situation the dictionary content is still cold. I did not pursue this optimization because my code does not have cold dictionary gating. I am not sure how prevalent cold extDict compression is, depending on the prevalence this might be worth pursuing.

lib/compress/zstd_fast.c

Cyan4973 · 2022-04-19T20:44:19Z

Great results @embg !

You'll just have to update the regression tests using the new (improved) compression ratio of this PR.

I also have a few minor questions left, but that shouldn't change the outcome much. This PR is in good shape.

embg · 2022-04-19T21:23:17Z

Great results @embg !

You'll just have to update the regression tests using the new (improved) compression ratio of this PR.

I also have a few minor questions left, but that shouldn't change the outcome much. This PR is in good change.

Thanks! @felixhandte and I looked at the regression test numbers and it seems like the negative levels might have some ratio regressions (haven't looked closely yet to see if they are serious or not). This is consistent with what Felix encountered while doing the noDict pipeline. I am going to produce graphs like Felix did in #2921 to see if the regression for negative levels is an acceptable speed-ratio tradeoff (that's how Felix was able to land the noDict pipeline).

lib/compress/zstd_fast.c

…fixes

embg · 2022-04-20T18:48:23Z

Added a commit addressing comments so far:

Optimizes the repcode predicate (checks for invalid offsets before entering the loop)
Adds an optimized variant for the case hasStep == 0
Fixes some cosmetic nits (scope of mval etc)

Comments I didn't address:

Adding an UNLIKELY() macro -- will try this when I pull out the repcode check in a documented helper function (separate PR, June or July)
Didn't add comments explaining the change in the repcode predicate to exclude match0 == prefixStart. Will explain this in the helper function which will have documentation for this predicate across all dictionary matchfinders.

Remaining steps to land:

Address any remaining comments (@felixhandte @Cyan4973 are there any?)
Re-run previous measurements
Measure speed-ratio tradeoff at negative levels, and if it is acceptable update the regression test csv file

Cyan4973 · 2022-04-20T19:17:11Z

Comments I didn't address:

Adding an UNLIKELY() macro -- will try this when I pull out the repcode check in a documented helper function (separate PR, June or July)

My bad,
I though this macro already existed within zstd code base, and one could just employ it.
But I greped and only found it as part of sub-module xxhash, not zstd proper.

No need to add that as part of this PR.

Cyan4973 · 2022-04-20T19:21:02Z

Address any remaining comments (@felixhandte @Cyan4973 are there any?)

Only waiting for the documentation on negative compression levels, and then an update of the regression csv file.

Other than that, this PR seems in good shape and good to go.

felixhandte

Overall this looks solid! (I guess it's not surprising I would be fine with a refactor that mirrors one I wrote...)

As discussed, please provide final benchmark numbers.

lib/compress/zstd_fast.c

embg · 2022-04-26T23:16:08Z

Addressed @felixhandte's comments in 518cb83. The main update is hardcoding the repcode safety check in the variant where hasStep == 0 (not that those two things are not directly connected, but they are both the "common scenario", so they should share one variant). See my measurements here justifying the variant.

Will do full measurements and update the regression csv file tomorrow.

Edit: "directly" -> "not directly", typo

embg · 2022-04-27T22:14:39Z

It seems like my last commit (with full implementation of the repcode variant + cosmetic stuff) regressed perf pretty significantly:

GCC

././largeNbDicts_353626_gcc11
148.7
148.5
148.6

././largeNbDicts_2820ef_gcc11
148.7
148.3
148.4

././largeNbDicts_809f65_gcc11
148.2
148.3
148.3

././largeNbDicts_518cb8_gcc11
146.4
146.5
146.4

Clang

././largeNbDicts_353626_clang12
150.5
150.5
150.6

././largeNbDicts_2820ef_clang12
150.6
150.7
150.6

././largeNbDicts_809f65_clang12
150.4
150.2
150.2

././largeNbDicts_518cb8_clang12
149.3
149.3
149.4

I will revert that now and put up a new commit tomorrow removing the hasStep variant (since the cost of the variant was only justified using numbers for hasStep + repcode).

This reverts commit 518cb83.

…ease)

Cyan4973 · 2022-05-04T21:58:59Z

I'm trying to understand the evolution of performance for negative compression levels,
and there seems to be something off.

When looking at the updated regression test figures,
a picture emerges for the compression ratio of negative compression levels :

--fast=1 is similar (though not identical), typically very slightly worse (does not matter)
the new --fast=3 has a compression ratio roughly on par with old --fast=5, hence "worse", which doesn't matter if the speed of the new --fast=3 is increases to become similar to the old --fast=5
the new --fast=5 is much worse. There is no "old" --fast level to compare to, but the gap is substantial. Here also, it's not a problem if speed increases by a similar amount.

Problem is, when looking at the report at the top of this PR, the story described is completely different.
I quote :

Everything is speed AND ratio neutral/positive except fast level 3, which trades 2.7% speed regression for 2.4% ratio improvement

So, that's a completely different picture.

How do we reconcile ?

terrelln · 2022-05-04T22:07:45Z

I only see that regression in the results.csv for the github files. If you look at silesia, you see neutral to positive ratio.

Since GitHub is so compressible, it could be very sensitive to small changes to the positions searched. If you look at github.tar level -5 vs. level -5 w/ dictionary, we can see that the dictionary still wins a little. I think it is likely that what is happening is that we have perturbed the positions we've searched slightly, and suddenly we're mostly searching at "useless" positions instead of "useful" positions. Though that is just a theory.

We should just be sure to measure compression ratio on several datasets to make sure that the github example is an outlier, and not the general case.

embg · 2022-05-05T16:01:41Z

How do we reconcile ?

@Cyan4973 Sorry for the miscommunication -- that quote is referring to the table below it, which only covers silesia (following the same ratio/speed tradeoff estimation method as #2921). I'll add a discussion of results.csv to the PR summary later today.

html16K and silesia are both ratio-positive at all levels (silesia we can see in results.csv, html16K I measured myself). github is also positive at level 1, but has a major regression at level -5 (also minor regressions at levels -1 and -2).

Why is github special? As @terrelln pointed out, it has extremely long matches, making it very sensitive to hash table collisions. Especially for dictionary compression, adding additional hash table writes can overwrite long matches in the dictionary, trading them for much less valuable indices in the prefix.

And in this PR we do add an additional hash table write, following @felixhandte's noDict PR. I did a quick experiment deleting that line of code, and the github dataset goes from major regression to major ratio win. However, github.tar is still a regression, though smaller than before.

Here are three results.csv files corresponding to dev, this PR with the extra write, and this PR without the extra write: https://gist.github.com/embg/064f088a0bf7e5585f6114f2ef5219d2 (all three files are in one gist). I used a regex to filter extraneous lines: grep -E '(level 1|level \-[0-9]+)( |,)' results.csv.

Deleting that write is unlikely to affect performance (since it's outside the hot loop), so this is just a question of which CSV file you prefer :). Just let me know if you want me to delete that extra hash table write and I'll quickly redo perf measurements, otherwise I'll land as-is.

embg · 2022-05-05T17:00:18Z

@Cyan4973 Here are the numbers for html16K at negative compression levels with a 32K dictionary (this is for the code as-is, not the modification I mentioned in my previous comment):

--fast=1: 25517077 bytes opt, 25554192 bytes dev
--fast=3: 27725969 bytes opt, 31154011 bytes dev
--fast=5: 30530273 bytes opt, 33781945 bytes dev

This is an 8-9% ratio win at levels -3 and -5. html16K is more representative of real-world data since it has a more realistic compression ratio (3-4x at level 1 depending on the dictionary size IIRC), so I would give this win more weight than the github loss.

Based on these numbers my vote would be to land the current version, but I understand if you prefer to focus on mitigating the github ratio regression instead (by deleting that extra hash table write). Just let me know which approach you prefer!

Cyan4973 · 2022-05-05T18:46:33Z

OK, that's fine. I think we are ready to accept that github.tar is a fairly specific scenario, with more impact in the randomness of searches during negative compression levels, which is not representative of other use cases.

There is a remaining issue in a FreeBSD test, but it seems unrelated to this PR.

felixhandte

Looks good to ship to me!

embg · 2022-05-05T19:01:20Z

Thanks @felixhandte @Cyan4973 for the detailed reviews! I will fix the FreeBSD infra failure sometime next week. That job passed on all commits before I updated results.csv so the failure is clearly unrelated, therefore I'll go ahead and merge.

nadavrot · 2022-09-07T21:57:25Z

Nice!

embg added 2 commits April 15, 2022 12:16

Port noDict pipeline

3536262

Nits

2820efe

facebook-github-bot added the CLA Signed label Apr 19, 2022

embg commented Apr 19, 2022

View reviewed changes

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

Cyan4973 reviewed Apr 19, 2022

View reviewed changes

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

Cyan4973 assigned embg Apr 19, 2022

Cyan4973 reviewed Apr 19, 2022

View reviewed changes

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

Cyan4973 reviewed Apr 19, 2022

View reviewed changes

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

Optimize repcode predicate, hardcode hasStep == 0 scenario, cosmetic …

809f652

…fixes

felixhandte reviewed Apr 25, 2022

View reviewed changes

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved

Hardcode repcode safety check, fix cosmetic nits

518cb83

embg added 3 commits April 27, 2022 18:16

Revert "Hardcode repcode safety check, fix cosmetic nits"

6a2e1f7

This reverts commit 518cb83.

Final nit

ce6b69f

Remove hasStep variant (not enough wins to justify the code size incr…

ac371be

…ease)

embg force-pushed the fast_extdict_pipeline2 branch from ca50e4f to ac371be Compare April 28, 2022 22:06

Update results.csv

3be9a81

Cyan4973 approved these changes May 5, 2022

View reviewed changes

felixhandte approved these changes May 5, 2022

View reviewed changes

embg merged commit 7915c11 into facebook:dev May 5, 2022

embg mentioned this pull request May 10, 2022

Correct and clarify repcode offset history logic #3127

Merged

embg mentioned this pull request Jun 14, 2022

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

Merged

5 tasks

embg changed the title ~~Software pipeline for ZSTD_compressBlock_fast_extDict~~ Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) Jun 23, 2022

embg added the optimization label Jun 28, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

embg commented Apr 19, 2022 •

edited

Loading

Cyan4973 commented Apr 19, 2022 •

edited

Loading

embg commented Apr 19, 2022

embg commented Apr 20, 2022

Cyan4973 commented Apr 20, 2022

Cyan4973 commented Apr 20, 2022

felixhandte left a comment

embg commented Apr 26, 2022 •

edited

Loading

embg commented Apr 27, 2022

Cyan4973 commented May 4, 2022 •

edited

Loading

terrelln commented May 4, 2022

embg commented May 5, 2022 •

edited

Loading

embg commented May 5, 2022

Cyan4973 commented May 5, 2022 •

edited

Loading

felixhandte left a comment

embg commented May 5, 2022

nadavrot commented Sep 7, 2022

Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

Conversation

embg commented Apr 19, 2022 • edited Loading

Summary

Preliminary results

Dictionary compression

Streaming compression

Final results

Dictionary compression

Streaming compression

What didn't work

Future work

Cyan4973 commented Apr 19, 2022 • edited Loading

embg commented Apr 19, 2022

embg commented Apr 20, 2022

Cyan4973 commented Apr 20, 2022

Cyan4973 commented Apr 20, 2022

felixhandte left a comment

Choose a reason for hiding this comment

embg commented Apr 26, 2022 • edited Loading

embg commented Apr 27, 2022

Cyan4973 commented May 4, 2022 • edited Loading

terrelln commented May 4, 2022

embg commented May 5, 2022 • edited Loading

embg commented May 5, 2022

Cyan4973 commented May 5, 2022 • edited Loading

felixhandte left a comment

Choose a reason for hiding this comment

embg commented May 5, 2022

nadavrot commented Sep 7, 2022

embg commented Apr 19, 2022 •

edited

Loading

Cyan4973 commented Apr 19, 2022 •

edited

Loading

embg commented Apr 26, 2022 •

edited

Loading

Cyan4973 commented May 4, 2022 •

edited

Loading

embg commented May 5, 2022 •

edited

Loading

Cyan4973 commented May 5, 2022 •

edited

Loading