[huf] Add generic C versions of the fast decoding loops #3449

terrelln · 2023-01-24T01:05:16Z

Add generic C versions of the fast decoding loops to serve architectures that don't have an assembly implementation. Also allow selecting the C decoding loop over the assembly decoding loop through a zstd decompression parameter ZSTD_d_disableHuffmanAssembly.

I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor. The benchmark command forces zstd to compress without any matches, using only literals compression, and measures only Huffman decompression speed:

zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar

The new fast decoding loops outperform the previous implementation uniformly, but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer from the same stability problems that we've seen in the past, where the assembly version doesn't. So even though clang gets close to assembly on x86-64, it still has stability issues.

Arch	Function	Compiler	Default (MB/s)	Assembly (MB/s)	Fast C (MB/s)
x86-64	decompress 4X1	gcc-12.2.0	1029.6	1308.1	1208.1
x86-64	decompress 4X1	clang-14.0.6	1019.3	1305.6	1276.3
x86-64	decompress 4X2	gcc-12.2.0	1348.5	1657.0	1374.1
x86-64	decompress 4X2	clang-14.0.6	1027.6	1659.9	1468.1
aarch64	decompress 4X1	clang-12.0.5	1081.0	N/A	1234.9
aarch64	decompress 4X2	clang-12.0.5	1270.0	N/A	1516.6

Add generic C versions of the fast decoding loops to serve architectures that don't have an assembly implementation. Also allow selecting the C decoding loop over the assembly decoding loop through a zstd decompression parameter `ZSTD_d_disableHuffmanAssembly`. I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor. The benchmark command forces zstd to compress without any matches, using only literals compression, and measures only Huffman decompression speed: ``` zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar ``` The new fast decoding loops outperform the previous implementation uniformly, but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer from the same stability problems that we've seen in the past, where the assembly version doesn't. So even though clang gets close to assembly on x86-64, it still has stability issues. | Arch | Function | Compiler | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) | |---------|----------------|--------------|----------------|-----------------|-------------| | x86-64 | decompress 4X1 | gcc-12.2.0 | 1029.6 | 1308.1 | 1208.1 | | x86-64 | decompress 4X1 | clang-14.0.6 | 1019.3 | 1305.6 | 1276.3 | | x86-64 | decompress 4X2 | gcc-12.2.0 | 1348.5 | 1657.0 | 1374.1 | | x86-64 | decompress 4X2 | clang-14.0.6 | 1027.6 | 1659.9 | 1468.1 | | aarch64 | decompress 4X1 | clang-12.0.5 | 1081.0 | N/A | 1234.9 | | aarch64 | decompress 4X2 | clang-12.0.5 | 1270.0 | N/A | 1516.6 |

Before calling a dictionary good, make sure that it can compress an input. If v0.7.3 rejects v0.7.3's dictionary, fall back to the v1.0 dictionary. This is not the job of the verison test to test it, because we cannot fix this code.

facebook-github-bot added the CLA Signed label Jan 24, 2023

terrelln force-pushed the 2023-01-13-fast-huffman-c branch 3 times, most recently from 46bbeff to 57732f7 Compare January 24, 2023 01:52

terrelln added 2 commits January 23, 2023 18:01

Cyan4973 approved these changes Jan 25, 2023

View reviewed changes

terrelln merged commit 321490c into facebook:dev Jan 25, 2023

This was referenced Jan 25, 2023

Huffman assembly is slower than no asm on Zen 2 #3278

Open

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Closed

Process five symbols per stream per iteration on AArch64. #3299

Closed

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

embg mentioned this pull request Feb 10, 2023

Fix all MSVC warnings #3495

Merged

iksaif mentioned this pull request Sep 18, 2023

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[huf] Add generic C versions of the fast decoding loops #3449

[huf] Add generic C versions of the fast decoding loops #3449

terrelln commented Jan 24, 2023 •

edited

Loading

[huf] Add generic C versions of the fast decoding loops #3449

[huf] Add generic C versions of the fast decoding loops #3449

Conversation

terrelln commented Jan 24, 2023 • edited Loading

terrelln commented Jan 24, 2023 •

edited

Loading