Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[huf] Add generic C versions of the fast decoding loops #3449

Merged
merged 2 commits into from
Jan 25, 2023

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Jan 24, 2023

Add generic C versions of the fast decoding loops to serve architectures that don't have an assembly implementation. Also allow selecting the C decoding loop over the assembly decoding loop through a zstd decompression parameter ZSTD_d_disableHuffmanAssembly.

I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor. The benchmark command forces zstd to compress without any matches, using only literals compression, and measures only Huffman decompression speed:

zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar

The new fast decoding loops outperform the previous implementation uniformly, but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer from the same stability problems that we've seen in the past, where the assembly version doesn't. So even though clang gets close to assembly on x86-64, it still has stability issues.

Arch Function Compiler Default (MB/s) Assembly (MB/s) Fast C (MB/s)
x86-64 decompress 4X1 gcc-12.2.0 1029.6 1308.1 1208.1
x86-64 decompress 4X1 clang-14.0.6 1019.3 1305.6 1276.3
x86-64 decompress 4X2 gcc-12.2.0 1348.5 1657.0 1374.1
x86-64 decompress 4X2 clang-14.0.6 1027.6 1659.9 1468.1
aarch64 decompress 4X1 clang-12.0.5 1081.0 N/A 1234.9
aarch64 decompress 4X2 clang-12.0.5 1270.0 N/A 1516.6

@terrelln terrelln force-pushed the 2023-01-13-fast-huffman-c branch 3 times, most recently from 46bbeff to 57732f7 Compare January 24, 2023 01:52
Add generic C versions of the fast decoding loops to serve architectures
that don't have an assembly implementation. Also allow selecting the C
decoding loop over the assembly decoding loop through a zstd
decompression parameter `ZSTD_d_disableHuffmanAssembly`.

I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor.
The benchmark command forces zstd to compress without any matches, using
only literals compression, and measures only Huffman decompression speed:

```
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
```

The new fast decoding loops outperform the previous implementation uniformly,
but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer
from the same stability problems that we've seen in the past, where the assembly
version doesn't. So even though clang gets close to assembly on x86-64, it still
has stability issues.

| Arch    | Function       | Compiler     | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) |
|---------|----------------|--------------|----------------|-----------------|-------------|
| x86-64  | decompress 4X1 | gcc-12.2.0   |         1029.6 |          1308.1 |      1208.1 |
| x86-64  | decompress 4X1 | clang-14.0.6 |         1019.3 |          1305.6 |      1276.3 |
| x86-64  | decompress 4X2 | gcc-12.2.0   |         1348.5 |          1657.0 |      1374.1 |
| x86-64  | decompress 4X2 | clang-14.0.6 |         1027.6 |          1659.9 |      1468.1 |
| aarch64 | decompress 4X1 | clang-12.0.5 |         1081.0 |             N/A |      1234.9 |
| aarch64 | decompress 4X2 | clang-12.0.5 |         1270.0 |             N/A |      1516.6 |
Before calling a dictionary good, make sure that it can compress an
input. If v0.7.3 rejects v0.7.3's dictionary, fall back to the v1.0
dictionary. This is not the job of the verison test to test it, because
we cannot fix this code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants