Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114

Merged
merged 8 commits into from
May 5, 2022

Conversation

embg
Copy link
Contributor

@embg embg commented Apr 19, 2022

Summary

Using similar techniques as #2749 and #3086, improves level-1 compression speed by 4-9% on a dataset of HTML headers between 8 and 16KB (the dataset has ratio 3.03 w/o dictionary, 4.92 with a 110KB dictionary). Dictionary compression speed improvements hold across gcc/clang and a variety of dictionary temperatures and sizes. Additionally, compression speed for level-1 streaming compression (i.e. zstd --single-thread silesia.tar) is improved by 4-5%.

Compression ratio is also slightly improved. For example, the compressed size of silesia.tar is reduced by 0.1%.

Preliminary results

I have added final measurements below, preliminary measurements are still here if you are interested:

Dictionary compression

I used the experimental setup described in #3086 (see the section titled "Final Results"). On the 4KB dictionary, speed improvements are 4-5% across all temperatures and compilers. On the larger dictionaries (more realistic), speed improvements are 6-9% across all temperatures and compilers.

Dictionary compression speed improvements at level 1 (preliminary)
/Users/embg/results0415/html_16K/cold/4K
	gcc11: 5.4% (± 0.0%)
	clang12: 4.3% (± 0.1%)

/Users/embg/results0415/html_16K/cold/32K
	gcc11: 8.5% (± 0.0%)
	clang12: 6.2% (± 0.1%)

/Users/embg/results0415/html_16K/cold/110K
	gcc11: 9.1% (± 0.1%)
	clang12: 6.6% (± 0.1%)

/Users/embg/results0415/html_16K/warm/4K
	gcc11: 5.6% (± 0.1%)
	clang12: 4.5% (± 0.1%)

/Users/embg/results0415/html_16K/warm/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.4% (± 0.2%)

/Users/embg/results0415/html_16K/warm/110K
	gcc11: 8.2% (± 0.9%)
	clang12: 6.9% (± 1.0%)

/Users/embg/results0415/html_16K/hot/4K
	gcc11: 5.8% (± 0.1%)
	clang12: 4.6% (± 0.1%)

/Users/embg/results0415/html_16K/hot/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.1% (± 0.1%)

/Users/embg/results0415/html_16K/hot/110K
	gcc11: 7.9% (± 0.1%)
	clang12: 5.7% (± 0.1%)

There were small improvements in compression ratio at all dictionary sizes. The compressed size of the full html16K dataset was reduced as follows: 0.18% with the 110KB dictionary, 0.22% with the 32KB dictionary, and 0.52% with the 4KB dictionary.

Streaming compression

Compression speed was measured via time ./zstd_bin -1 --single-thread silesia.tar. I timed each binary a few times interleaved and took the best time for each one. I tested the same compilers as for dictionary compression (gcc11 and clang12) and used the same setup (same CPU, core isolation, turbo disabled).

460780f (dev) 3536262 (this PR) % improvement
gcc11 1.49 (s) user 1.41 (s) user 5.4%
clang12 1.47 (s) user 1.41 (s) user 4.1%

The compressed size of silesia.tar is reduced by 0.17%.

Final results

Dictionary compression

I used the experimental setup described in #3086 (see the section titled "Final Results"). The final dictionary perf numbers are very close to the preliminary numbers, except the clang win is about 1% higher now on most scenarios. Gcc numbers are almost identical.

My follow-up commits on this PR did not touch ratio so that is unchanged. See the Preliminary Results section for a discussion of the ratio impacts.

Dictionary compression speed improvements at level 1 (final)
results0502/html_16K/cold/4K
	gcc11: 5.5% (± 0.1%)
	clang12: 5.4% (± 0.1%)

results0502/html_16K/cold/32K
	gcc11: 8.5% (± 0.1%)
	clang12: 6.7% (± 0.1%)

results0502/html_16K/cold/110K
	gcc11: 9.2% (± 0.1%)
	clang12: 7.2% (± 0.1%)

results0502/html_16K/warm/4K
	gcc11: 5.6% (± 0.1%)
	clang12: 5.6% (± 0.1%)

results0502/html_16K/warm/32K
	gcc11: 8.5% (± 0.2%)
	clang12: 6.7% (± 0.2%)

results0502/html_16K/warm/110K
	gcc11: 8.5% (± 0.7%)
	clang12: 7.8% (± 1.3%)

results0502/html_16K/hot/4K
	gcc11: 5.7% (± 0.1%)
	clang12: 5.5% (± 0.0%)

results0502/html_16K/hot/32K
	gcc11: 8.4% (± 0.1%)
	clang12: 6.3% (± 0.2%)

results0502/html_16K/hot/110K
	gcc11: 8.0% (± 0.1%)
	clang12: 5.9% (± 0.1%)

Streaming compression

User time numbers are for zstd <LEVEL> --single-thread /mnt/ramdisk/silesia.tar -o /dev/null

I used core isolation and disabled turbo. Benchmarking method was roughly “run many many times interleaved, take the lowest number that I’ve seen at least twice for each binary”. Speed variations were around 3% due to background load on the server.

Everything is speed AND ratio neutral/positive except fast level 3, which trades 2.7% speed regression for 2.4% ratio improvement (very good tradeoff IMO). Also worth noting that fast level 2 has a significant ratio improvement for no cost in speed. Fast level 1 has a significant speed improvement but no meaningful change in ratio.

Level: 460780 gcc11 ac371b gcc11 460780 clang12 ac371b clang12
-1 1.49 (s), 34.69% ratio 1.41 (s), 34.63% ratio 1.45 (s), 34.69% ratio 1.39 (s), 34.63% ratio
--fast=1 1.35 (s), 41.10% ratio 1.29 (s), 41.04% ratio 1.33 (s), 41.10% ratio 1.27 (s), 41.04% ratio
--fast=2 1.18 (s), 43.56% ratio 1.19 (s), 42.68% ratio 1.17 (s), 43.56% ratio 1.17 (s), 42.68% ratio
--fast=3 1.08 (s), 45.77% ratio 1.11 (s), 44.66% ratio 1.06 (s), 45.77% ratio 1.10 (s), 44.66% ratio

What didn't work

  • A simpler pipeline with only two positions (ip0/ip1). This is what I tried initially, following my DMS pipeline in Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) #3086. gcc wins were well over 5%, but unfortunately clang speed regressed a few %. I tried a couple approaches to fix the clang regression (documented here), but none of them worked. In the end, I decided to rewrite the pipeline closer to @felixhandte's noDict implementation, and this fixed the clang regression.
  • Optimizing backwards match length calculation. The current implementation calculates backwards match length byte by byte. This creates a ton of branch mispredictions whenever the backwards match length is non-zero. I tried to rewrite this code to look back 8 bytes at a time, reducing branch mispredictions. Unfortunately, the increased overhead seems to have slightly outweighed the benefit from better branch prediction, with speed regressions around 1%. My commits are saved here.

Future work

Compression speed could be optimized for cold dictionaries by prefetching the dictionary content starting from dictEnd. The hash table is warm because this is extDict, but in a cold dictionary situation the dictionary content is still cold. I did not pursue this optimization because my code does not have cold dictionary gating. I am not sure how prevalent cold extDict compression is, depending on the prevalence this might be worth pursuing.

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 19, 2022

Great results @embg !

You'll just have to update the regression tests using the new (improved) compression ratio of this PR.

I also have a few minor questions left, but that shouldn't change the outcome much. This PR is in good shape.

@embg
Copy link
Contributor Author

embg commented Apr 19, 2022

Great results @embg !

You'll just have to update the regression tests using the new (improved) compression ratio of this PR.

I also have a few minor questions left, but that shouldn't change the outcome much. This PR is in good change.

Thanks! @felixhandte and I looked at the regression test numbers and it seems like the negative levels might have some ratio regressions (haven't looked closely yet to see if they are serious or not). This is consistent with what Felix encountered while doing the noDict pipeline. I am going to produce graphs like Felix did in #2921 to see if the regression for negative levels is an acceptable speed-ratio tradeoff (that's how Felix was able to land the noDict pipeline).

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
@embg
Copy link
Contributor Author

embg commented Apr 20, 2022

Added a commit addressing comments so far:

  • Optimizes the repcode predicate (checks for invalid offsets before entering the loop)
  • Adds an optimized variant for the case hasStep == 0
  • Fixes some cosmetic nits (scope of mval etc)

Comments I didn't address:

  • Adding an UNLIKELY() macro -- will try this when I pull out the repcode check in a documented helper function (separate PR, June or July)
  • Didn't add comments explaining the change in the repcode predicate to exclude match0 == prefixStart. Will explain this in the helper function which will have documentation for this predicate across all dictionary matchfinders.

Remaining steps to land:

  • Address any remaining comments (@felixhandte @Cyan4973 are there any?)
  • Re-run previous measurements
  • Measure speed-ratio tradeoff at negative levels, and if it is acceptable update the regression test csv file

@Cyan4973
Copy link
Contributor

Comments I didn't address:

  • Adding an UNLIKELY() macro -- will try this when I pull out the repcode check in a documented helper function (separate PR, June or July)

My bad,
I though this macro already existed within zstd code base, and one could just employ it.
But I greped and only found it as part of sub-module xxhash, not zstd proper.

No need to add that as part of this PR.

@Cyan4973
Copy link
Contributor

Only waiting for the documentation on negative compression levels, and then an update of the regression csv file.

Other than that, this PR seems in good shape and good to go.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks solid! (I guess it's not surprising I would be fine with a refactor that mirrors one I wrote...)

As discussed, please provide final benchmark numbers.

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
@embg
Copy link
Contributor Author

embg commented Apr 26, 2022

Addressed @felixhandte's comments in 518cb83. The main update is hardcoding the repcode safety check in the variant where hasStep == 0 (not that those two things are not directly connected, but they are both the "common scenario", so they should share one variant). See my measurements here justifying the variant.

Will do full measurements and update the regression csv file tomorrow.

Edit: "directly" -> "not directly", typo

@embg
Copy link
Contributor Author

embg commented Apr 27, 2022

It seems like my last commit (with full implementation of the repcode variant + cosmetic stuff) regressed perf pretty significantly:

GCC
././largeNbDicts_353626_gcc11
148.7
148.5
148.6

././largeNbDicts_2820ef_gcc11
148.7
148.3
148.4

././largeNbDicts_809f65_gcc11
148.2
148.3
148.3

././largeNbDicts_518cb8_gcc11
146.4
146.5
146.4
Clang
././largeNbDicts_353626_clang12
150.5
150.5
150.6

././largeNbDicts_2820ef_clang12
150.6
150.7
150.6

././largeNbDicts_809f65_clang12
150.4
150.2
150.2

././largeNbDicts_518cb8_clang12
149.3
149.3
149.4

I will revert that now and put up a new commit tomorrow removing the hasStep variant (since the cost of the variant was only justified using numbers for hasStep + repcode).

@embg embg force-pushed the fast_extdict_pipeline2 branch from ca50e4f to ac371be Compare April 28, 2022 22:06
@Cyan4973
Copy link
Contributor

Cyan4973 commented May 4, 2022

I'm trying to understand the evolution of performance for negative compression levels,
and there seems to be something off.

When looking at the updated regression test figures,
a picture emerges for the compression ratio of negative compression levels :

  • --fast=1 is similar (though not identical), typically very slightly worse (does not matter)
  • the new --fast=3 has a compression ratio roughly on par with old --fast=5, hence "worse", which doesn't matter if the speed of the new --fast=3 is increases to become similar to the old --fast=5
  • the new --fast=5 is much worse. There is no "old" --fast level to compare to, but the gap is substantial. Here also, it's not a problem if speed increases by a similar amount.

Problem is, when looking at the report at the top of this PR, the story described is completely different.
I quote :

Everything is speed AND ratio neutral/positive except fast level 3, which trades 2.7% speed regression for 2.4% ratio improvement

So, that's a completely different picture.

How do we reconcile ?

@terrelln
Copy link
Contributor

terrelln commented May 4, 2022

I only see that regression in the results.csv for the github files. If you look at silesia, you see neutral to positive ratio.

Since GitHub is so compressible, it could be very sensitive to small changes to the positions searched. If you look at github.tar level -5 vs. level -5 w/ dictionary, we can see that the dictionary still wins a little. I think it is likely that what is happening is that we have perturbed the positions we've searched slightly, and suddenly we're mostly searching at "useless" positions instead of "useful" positions. Though that is just a theory.

We should just be sure to measure compression ratio on several datasets to make sure that the github example is an outlier, and not the general case.

@embg
Copy link
Contributor Author

embg commented May 5, 2022

How do we reconcile ?

@Cyan4973 Sorry for the miscommunication -- that quote is referring to the table below it, which only covers silesia (following the same ratio/speed tradeoff estimation method as #2921). I'll add a discussion of results.csv to the PR summary later today.

html16K and silesia are both ratio-positive at all levels (silesia we can see in results.csv, html16K I measured myself). github is also positive at level 1, but has a major regression at level -5 (also minor regressions at levels -1 and -2).

Why is github special? As @terrelln pointed out, it has extremely long matches, making it very sensitive to hash table collisions. Especially for dictionary compression, adding additional hash table writes can overwrite long matches in the dictionary, trading them for much less valuable indices in the prefix.

And in this PR we do add an additional hash table write, following @felixhandte's noDict PR. I did a quick experiment deleting that line of code, and the github dataset goes from major regression to major ratio win. However, github.tar is still a regression, though smaller than before.

Here are three results.csv files corresponding to dev, this PR with the extra write, and this PR without the extra write: https://gist.github.com/embg/064f088a0bf7e5585f6114f2ef5219d2 (all three files are in one gist). I used a regex to filter extraneous lines: grep -E '(level 1|level \-[0-9]+)( |,)' results.csv.

Deleting that write is unlikely to affect performance (since it's outside the hot loop), so this is just a question of which CSV file you prefer :). Just let me know if you want me to delete that extra hash table write and I'll quickly redo perf measurements, otherwise I'll land as-is.

@embg
Copy link
Contributor Author

embg commented May 5, 2022

@Cyan4973 Here are the numbers for html16K at negative compression levels with a 32K dictionary (this is for the code as-is, not the modification I mentioned in my previous comment):

  • --fast=1: 25517077 bytes opt, 25554192 bytes dev
  • --fast=3: 27725969 bytes opt, 31154011 bytes dev
  • --fast=5: 30530273 bytes opt, 33781945 bytes dev

This is an 8-9% ratio win at levels -3 and -5. html16K is more representative of real-world data since it has a more realistic compression ratio (3-4x at level 1 depending on the dictionary size IIRC), so I would give this win more weight than the github loss.

Based on these numbers my vote would be to land the current version, but I understand if you prefer to focus on mitigating the github ratio regression instead (by deleting that extra hash table write). Just let me know which approach you prefer!

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 5, 2022

OK, that's fine. I think we are ready to accept that github.tar is a fairly specific scenario, with more impact in the randomness of searches during negative compression levels, which is not representative of other use cases.

There is a remaining issue in a FreeBSD test, but it seems unrelated to this PR.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to ship to me!

@embg
Copy link
Contributor Author

embg commented May 5, 2022

Thanks @felixhandte @Cyan4973 for the detailed reviews! I will fix the FreeBSD infra failure sometime next week. That job passed on all commits before I updated results.csv so the failure is clearly unrelated, therefore I'll go ahead and merge.

@embg embg merged commit 7915c11 into facebook:dev May 5, 2022
@embg embg changed the title Software pipeline for ZSTD_compressBlock_fast_extDict Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) Jun 23, 2022
@nadavrot
Copy link

nadavrot commented Sep 7, 2022

Nice!

@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants