-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Software pipeline for ZSTD_compressBlock_fast_extDict (+4-9% compression speed) #3114
Conversation
Great results @embg ! You'll just have to update the regression tests using the new (improved) compression ratio of this PR. I also have a few minor questions left, but that shouldn't change the outcome much. This PR is in good shape. |
Thanks! @felixhandte and I looked at the regression test numbers and it seems like the negative levels might have some ratio regressions (haven't looked closely yet to see if they are serious or not). This is consistent with what Felix encountered while doing the noDict pipeline. I am going to produce graphs like Felix did in #2921 to see if the regression for negative levels is an acceptable speed-ratio tradeoff (that's how Felix was able to land the noDict pipeline). |
Added a commit addressing comments so far:
Comments I didn't address:
Remaining steps to land:
|
My bad, No need to add that as part of this PR. |
Only waiting for the documentation on negative compression levels, and then an update of the regression Other than that, this PR seems in good shape and good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks solid! (I guess it's not surprising I would be fine with a refactor that mirrors one I wrote...)
As discussed, please provide final benchmark numbers.
Addressed @felixhandte's comments in 518cb83. The main update is hardcoding the repcode safety check in the variant where Will do full measurements and update the regression csv file tomorrow. Edit: "directly" -> "not directly", typo |
It seems like my last commit (with full implementation of the repcode variant + cosmetic stuff) regressed perf pretty significantly: GCC
Clang
I will revert that now and put up a new commit tomorrow removing the hasStep variant (since the cost of the variant was only justified using numbers for hasStep + repcode). |
ca50e4f
to
ac371be
Compare
I'm trying to understand the evolution of performance for negative compression levels, When looking at the updated regression test figures,
Problem is, when looking at the report at the top of this PR, the story described is completely different.
So, that's a completely different picture. How do we reconcile ? |
I only see that regression in the Since GitHub is so compressible, it could be very sensitive to small changes to the positions searched. If you look at We should just be sure to measure compression ratio on several datasets to make sure that the github example is an outlier, and not the general case. |
@Cyan4973 Sorry for the miscommunication -- that quote is referring to the table below it, which only covers silesia (following the same ratio/speed tradeoff estimation method as #2921). I'll add a discussion of html16K and silesia are both ratio-positive at all levels (silesia we can see in results.csv, html16K I measured myself). github is also positive at level 1, but has a major regression at level -5 (also minor regressions at levels -1 and -2). Why is github special? As @terrelln pointed out, it has extremely long matches, making it very sensitive to hash table collisions. Especially for dictionary compression, adding additional hash table writes can overwrite long matches in the dictionary, trading them for much less valuable indices in the prefix. And in this PR we do add an additional hash table write, following @felixhandte's noDict PR. I did a quick experiment deleting that line of code, and the github dataset goes from major regression to major ratio win. However, github.tar is still a regression, though smaller than before. Here are three Deleting that write is unlikely to affect performance (since it's outside the hot loop), so this is just a question of which CSV file you prefer :). Just let me know if you want me to delete that extra hash table write and I'll quickly redo perf measurements, otherwise I'll land as-is. |
@Cyan4973 Here are the numbers for html16K at negative compression levels with a 32K dictionary (this is for the code as-is, not the modification I mentioned in my previous comment):
This is an 8-9% ratio win at levels -3 and -5. html16K is more representative of real-world data since it has a more realistic compression ratio (3-4x at level 1 depending on the dictionary size IIRC), so I would give this win more weight than the github loss. Based on these numbers my vote would be to land the current version, but I understand if you prefer to focus on mitigating the github ratio regression instead (by deleting that extra hash table write). Just let me know which approach you prefer! |
OK, that's fine. I think we are ready to accept that There is a remaining issue in a FreeBSD test, but it seems unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to ship to me!
Thanks @felixhandte @Cyan4973 for the detailed reviews! I will fix the FreeBSD infra failure sometime next week. That job passed on all commits before I updated |
Nice! |
Summary
Using similar techniques as #2749 and #3086, improves level-1 compression speed by 4-9% on a dataset of HTML headers between 8 and 16KB (the dataset has ratio 3.03 w/o dictionary, 4.92 with a 110KB dictionary). Dictionary compression speed improvements hold across gcc/clang and a variety of dictionary temperatures and sizes. Additionally, compression speed for level-1 streaming compression (i.e.
zstd --single-thread silesia.tar
) is improved by 4-5%.Compression ratio is also slightly improved. For example, the compressed size of silesia.tar is reduced by 0.1%.
Preliminary results
I have added final measurements below, preliminary measurements are still here if you are interested:
Dictionary compression
I used the experimental setup described in #3086 (see the section titled "Final Results"). On the 4KB dictionary, speed improvements are 4-5% across all temperatures and compilers. On the larger dictionaries (more realistic), speed improvements are 6-9% across all temperatures and compilers.
Dictionary compression speed improvements at level 1 (preliminary)
There were small improvements in compression ratio at all dictionary sizes. The compressed size of the full html16K dataset was reduced as follows: 0.18% with the 110KB dictionary, 0.22% with the 32KB dictionary, and 0.52% with the 4KB dictionary.
Streaming compression
Compression speed was measured via
time ./zstd_bin -1 --single-thread silesia.tar
. I timed each binary a few times interleaved and took the best time for each one. I tested the same compilers as for dictionary compression (gcc11 and clang12) and used the same setup (same CPU, core isolation, turbo disabled).The compressed size of silesia.tar is reduced by 0.17%.
Final results
Dictionary compression
I used the experimental setup described in #3086 (see the section titled "Final Results"). The final dictionary perf numbers are very close to the preliminary numbers, except the clang win is about 1% higher now on most scenarios. Gcc numbers are almost identical.
My follow-up commits on this PR did not touch ratio so that is unchanged. See the Preliminary Results section for a discussion of the ratio impacts.
Dictionary compression speed improvements at level 1 (final)
Streaming compression
User time numbers are for
zstd <LEVEL> --single-thread /mnt/ramdisk/silesia.tar -o /dev/null
I used core isolation and disabled turbo. Benchmarking method was roughly “run many many times interleaved, take the lowest number that I’ve seen at least twice for each binary”. Speed variations were around 3% due to background load on the server.
Everything is speed AND ratio neutral/positive except fast level 3, which trades 2.7% speed regression for 2.4% ratio improvement (very good tradeoff IMO). Also worth noting that fast level 2 has a significant ratio improvement for no cost in speed. Fast level 1 has a significant speed improvement but no meaningful change in ratio.
What didn't work
Future work
Compression speed could be optimized for cold dictionaries by prefetching the dictionary content starting from dictEnd. The hash table is warm because this is extDict, but in a cold dictionary situation the dictionary content is still cold. I did not pursue this optimization because my code does not have cold dictionary gating. I am not sure how prevalent cold extDict compression is, depending on the prevalence this might be worth pursuing.