Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembler implementation for arm64 #51

Merged
merged 1 commit into from
Dec 4, 2022
Merged

Conversation

greatroar
Copy link
Contributor

Fixes #45, but I haven't yet benchmarked this on actual ARM hardware. On amd64 with qemu, I get the following results after cherry-picking #50 (I have a version that includes a variant of #42 as well):

name                 old speed      new speed      delta
Sum64/4B-8            104MB/s ± 1%   142MB/s ± 1%  +36.00%  (p=0.000 n=10+10)
Sum64/100B-8         1.51GB/s ± 2%  1.97GB/s ± 1%  +30.42%  (p=0.000 n=10+10)
Sum64/4KB-8          4.66GB/s ± 1%  5.48GB/s ± 1%  +17.56%  (p=0.000 n=9+9)
Sum64/10MB-8         4.79GB/s ± 0%  5.45GB/s ± 0%  +13.78%  (p=0.000 n=10+9)
Sum64String/4B-8      108MB/s ± 1%   149MB/s ± 1%  +38.39%  (p=0.000 n=10+8)
Sum64String/100B-8   1.59GB/s ± 2%  2.01GB/s ± 1%  +25.84%  (p=0.000 n=9+9)
Sum64String/4KB-8    4.67GB/s ± 1%  5.54GB/s ± 1%  +18.76%  (p=0.000 n=9+10)
Sum64String/10MB-8   4.78GB/s ± 1%  5.46GB/s ± 0%  +14.08%  (p=0.000 n=9+10)
DigestBytes/4B-8     37.4MB/s ± 1%  37.7MB/s ± 1%     ~     (p=0.093 n=10+10)
DigestBytes/100B-8    634MB/s ± 0%   657MB/s ± 0%   +3.57%  (p=0.000 n=10+9)
DigestBytes/4KB-8    4.33GB/s ± 0%  4.87GB/s ± 1%  +12.50%  (p=0.000 n=10+9)
DigestBytes/10MB-8   4.90GB/s ± 0%  5.47GB/s ± 1%  +11.54%  (p=0.000 n=10+10)
DigestString/4B-8    31.5MB/s ± 1%  31.6MB/s ± 1%     ~     (p=0.447 n=10+10)
DigestString/100B-8   561MB/s ± 1%   576MB/s ± 1%   +2.67%  (p=0.000 n=10+9)
DigestString/4KB-8   4.22GB/s ± 1%  4.72GB/s ± 0%  +11.77%  (p=0.000 n=9+9)
DigestString/10MB-8  4.90GB/s ± 0%  5.47GB/s ± 0%  +11.56%  (p=0.000 n=9+9)

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

@cespare
Copy link
Owner

cespare commented Nov 20, 2020

Awesome! It might take me a while to find the time to review this because I'm not too familiar with arm64 yet.

I haven't yet benchmarked this on actual ARM hardware.

I don't have physical access to hardware, but I can test on graviton2 instances in EC2 later if that helps.

@greatroar greatroar force-pushed the asm-arm64 branch 2 times, most recently from 8145830 to f6daa84 Compare August 23, 2021 06:26
@cespare
Copy link
Owner

cespare commented Sep 9, 2021

@greatroar sorry for the delay. I forgot about this for a while. I started taking a look but it will take me a bit to get up to speed on arm assembly. I'm pretty swamped with work atm so it may be a bit before I finish reviewing. Feel free to ping me if you haven't heard back in a couple of weeks.

@greatroar
Copy link
Contributor Author

Rebased to fix the merge commit. I've also finally found time to run this on actual hardware, a Raspberry Pi 4B. Benchmark results:

name                 old speed      new speed      delta
Sum64/4B-4            180MB/s ± 0%   251MB/s ± 0%  +39.13%  (p=0.000 n=10+10)
Sum64/100B-4          994MB/s ± 0%  1135MB/s ± 0%  +14.25%  (p=0.000 n=10+9)
Sum64/4KB-4          1.92GB/s ± 0%  1.93GB/s ± 0%   +0.43%  (p=0.000 n=10+10)
Sum64/10MB-4         1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.754 n=10+10)
Sum64String/4B-4      133MB/s ± 4%   228MB/s ± 0%  +71.37%  (p=0.000 n=10+9)
Sum64String/100B-4    949MB/s ± 0%  1103MB/s ± 0%  +16.17%  (p=0.000 n=10+10)
Sum64String/4KB-4    1.92GB/s ± 0%  1.93GB/s ± 0%   +0.40%  (p=0.000 n=9+8)
Sum64String/10MB-4   1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.146 n=10+8)
DigestBytes/4B-4     61.9MB/s ± 0%  61.9MB/s ± 0%     ~     (p=0.158 n=10+9)
DigestBytes/100B-4    695MB/s ± 0%   719MB/s ± 0%   +3.37%  (p=0.000 n=10+10)
DigestBytes/4KB-4    1.89GB/s ± 0%  1.90GB/s ± 0%   +0.43%  (p=0.000 n=9+10)
DigestBytes/10MB-4   1.88GB/s ± 0%  1.89GB/s ± 0%   +0.92%  (p=0.000 n=10+9)
DigestString/4B-4    58.9MB/s ± 0%  58.5MB/s ± 1%   -0.60%  (p=0.032 n=8+10)
DigestString/100B-4   669MB/s ± 0%   696MB/s ± 1%   +4.05%  (p=0.000 n=10+10)
DigestString/4KB-4   1.89GB/s ± 0%  1.89GB/s ± 0%   +0.34%  (p=0.000 n=10+10)
DigestString/10MB-4  1.88GB/s ± 0%  1.89GB/s ± 0%   +0.90%  (p=0.000 n=10+10)

I'll admit this is disappointing, but there is a speedup for small inputs. I find it hard to explain why the speedup stops at 4KiB. This CPU has a 32KiB L1 cache, so it shouldn't be running at L2 speed for that input size.

@cespare
Copy link
Owner

cespare commented Oct 30, 2021

Hey @greatroar, sorry I still haven't had time to review the code. It's still on my radar.

I did have a few minutes to spin up an EC2 instance to try the benchmarks on a beefier arm64 chip (I used a c6g.metal instance). The results are similar to your raspberry pi results but even more pronounced:

name                  old speed      new speed      delta
Sum64/4B-64            435MB/s ± 1%   680MB/s ± 0%  +56.47%  (p=0.000 n=10+10)
Sum64/100B-64         2.01GB/s ± 0%  2.07GB/s ± 0%   +3.17%  (p=0.000 n=9+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.03%  (p=0.000 n=8+8)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.000 n=9+9)
Sum64String/4B-64      373MB/s ± 0%   501MB/s ± 2%  +34.29%  (p=0.000 n=8+10)
Sum64String/100B-64   1.99GB/s ± 0%  2.08GB/s ± 0%   +4.89%  (p=0.000 n=9+8)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.11%  (p=0.000 n=10+8)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.006 n=10+9)
DigestBytes/4B-64      179MB/s ± 0%   179MB/s ± 0%   -0.07%  (p=0.001 n=7+7)
DigestBytes/100B-64   1.63GB/s ± 0%  1.64GB/s ± 0%   +0.53%  (p=0.000 n=9+8)
DigestBytes/4KB-64    3.25GB/s ± 0%  3.25GB/s ± 0%     ~     (p=0.138 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.051 n=10+8)
DigestString/4B-64     139MB/s ± 3%   137MB/s ± 2%     ~     (p=0.497 n=10+9)
DigestString/100B-64  1.52GB/s ± 1%  1.53GB/s ± 1%   +1.03%  (p=0.001 n=10+10)
DigestString/4KB-64   3.24GB/s ± 0%  3.23GB/s ± 0%   -0.07%  (p=0.000 n=9+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   -0.00%  (p=0.006 n=9+10)

It's suspicious that the speed is so precisely similar for larger block sizes. Seems like we must be hitting a hard bottleneck of some kind or else the generated code on the main path must be almost the same as the handwritten code.

@greatroar
Copy link
Contributor Author

The generated code is very similar, the main difference being the use of LDP/STP to transfer 16-byte blocks in the handwritten code. That doesn't seem to buy much. The reason it's faster on smaller inputs may be that the handwritten code does everything in registers, while the compiled code spills some stuff on the stack, and it loads the primes faster.

@Imanmahmoodi
Copy link

😃

@lizthegrey
Copy link

Please let me know if you need help testing this; always happy to help volunteer to make things run faster on arm64.

klauspost pushed a commit to klauspost/compress that referenced this pull request Jan 9, 2022
see cespare/xxhash#51

```
benchmark                                                    old ns/op      new ns/op      delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               14740529       14093294       -4.39%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           3003737        3008035        +0.14%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            70052218       70885931       +1.19%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              33424884       34714144       +3.86%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            9694846        9563735        -1.35%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             12322865       12782681       +3.73%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                7290319        7134523        -2.14%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          825251         816683         -1.04%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          462903         486823         +5.17%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                1333159120     1114784390     -16.38%
BenchmarkDecoder_DecoderSmall/html.zst-16                    3117950        3095118        -0.73%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           272085         271508         -0.21%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  1411294        1407632        -0.26%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              370870         367499         -0.91%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               4721330        4718339        -0.06%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 3517766        3487756        -0.85%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               1186672        1180367        -0.53%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                1498383        1502922        +0.30%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   748113         742537         -0.75%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             99132          98206          -0.93%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             53805          53209          -1.11%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   4041818        4028347        -0.33%
BenchmarkDecoder_DecodeAll/html.zst-16                       387794         383309         -1.16%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              34390          34296          -0.27%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          89510          88785          -0.81%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      23315          23128          -0.80%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       306437         325176         +6.12%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         226433         222589         -1.70%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       74612          74182          -0.58%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        95066          94304          -0.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           47498          46946          -1.16%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     6291           6237           -0.86%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     3498           3453           -1.29%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           295850         318916         +7.80%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               24340          24196          -0.59%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      2220           2199           -0.95%

benchmark                                                    old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               100.03       104.63       1.05x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           315.84       315.39       1.00x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            55.03        54.38        0.99x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              102.14       98.35        0.96x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            103.30       104.71       1.01x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             98.74        95.18        0.96x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                449.47       459.29       1.02x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          992.67       1003.08      1.01x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          2127.32      2022.80      0.95x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                4.21         5.04         1.20x
BenchmarkDecoder_DecoderSmall/html.zst-16                    262.74       264.67       1.01x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           119.84       120.10       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  130.60       130.94       1.00x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              319.76       322.69       1.01x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               102.06       102.13       1.00x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 121.31       122.36       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               105.49       106.05       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                101.50       101.20       1.00x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   547.51       551.62       1.01x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             1032.97      1042.70      1.01x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             2287.77      2313.39      1.01x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   173.71       174.29       1.00x
BenchmarkDecoder_DecodeAll/html.zst-16                       264.06       267.15       1.01x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              118.52       118.85       1.00x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          2059.22      2076.03      1.01x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      5086.37      5127.46      1.01x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1572.47      1481.85      0.94x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1884.68      1917.22      1.02x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       1677.73      1687.46      1.01x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        1599.82      1612.75      1.01x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           8623.46      8724.94      1.01x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     16276.14     16418.01     1.01x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     35194.18     35648.95     1.01x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2373.12      2201.48      0.93x
BenchmarkDecoder_DecodeAllParallel/html.zst-16               4207.06      4232.02      1.01x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      1835.78      1853.29      1.01x

benchmark                                                    old allocs     new allocs     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               1              3              +200.00%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           1              1              +0.00%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            4              5              +25.00%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              1              1              +0.00%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            1              1              +0.00%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             1              1              +0.00%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                1              1              +0.00%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                52             50             -3.85%
BenchmarkDecoder_DecoderSmall/html.zst-16                    1              1              +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           1              1              +0.00%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0              0              +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 0              0              +0.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                0              0              +0.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0              0              +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0              0              +0.00%

benchmark                                                    old bytes     new bytes     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               89288         113942        +27.61%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           10888         12311         +13.07%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            981385        1492955       +52.13%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              385552        367000        -4.81%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            30752         30503         -0.81%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             108837        56868         -47.75%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                75923         97336         +28.20%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          48            48            +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1964          1909          -2.80%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                25456880      25432640      -0.10%
BenchmarkDecoder_DecoderSmall/html.zst-16                    48            48            +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           51            48            -5.88%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0             0             +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              5             0             -100.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               2             0             -100.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 26            0             -100.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0             0             +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                10            0             -100.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0             0             +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             1             0             -100.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0             0             +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0             0             +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          234           231           -1.28%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      38            38            +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1991          2287          +14.87%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1501          1298          -13.52%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       132           130           -1.52%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        214           220           +2.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           259           258           -0.39%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     9             9             +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     6             6             +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2868          3358          +17.09%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               34            34            +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0             0             +0.00%
```
@cespare
Copy link
Owner

cespare commented Jan 10, 2022

@lizthegrey the problem is more that the testing has shown that the improvement is marginal.

@lizthegrey
Copy link

lizthegrey commented Jan 10, 2022

We see a 7% improvement in time spent in xxhash from adopting this within the @klauspost zstd library. https://share.getcloudapp.com/GGuEkRnR

IMO this is worth merging.

@cespare
Copy link
Owner

cespare commented Jan 11, 2022

@lizthegrey it would be helpful to know how the results you've screenshotted correspond to the microbenchmarks we've been discussing above. Do the zstd benchmarks run xxhash over mostly very short strings, for instance?

I took another look at the benchmarks using Go 1.18beta1. (I also added a 16-byte comparison on main.) Here are the results, again using a c6g.metal EC2 instance:

name                  old speed      new speed      delta
Sum64/4B-64            469MB/s ± 0%   672MB/s ± 0%  +43.24%  (p=0.000 n=8+10)
Sum64/16B-64          1.12GB/s ± 0%  1.61GB/s ± 0%  +43.24%  (p=0.000 n=9+10)
Sum64/100B-64         2.04GB/s ± 0%  2.07GB/s ± 0%   +1.67%  (p=0.000 n=10+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.094 n=10+9)
Sum64String/4B-64      441MB/s ± 0%   499MB/s ± 0%  +13.33%  (p=0.000 n=10+7)
Sum64String/16B-64    1.08GB/s ± 1%  1.43GB/s ± 0%  +32.17%  (p=0.000 n=10+10)
Sum64String/100B-64   2.02GB/s ± 0%  2.08GB/s ± 0%   +3.12%  (p=0.000 n=9+10)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.00%  (p=0.028 n=9+9)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.148 n=10+10)
DigestBytes/4B-64      207MB/s ± 1%   206MB/s ± 0%   -0.42%  (p=0.000 n=10+9)
DigestBytes/16B-64     669MB/s ± 0%   666MB/s ± 0%   -0.52%  (p=0.000 n=10+8)
DigestBytes/100B-64   1.85GB/s ± 0%  1.80GB/s ± 0%   -2.84%  (p=0.000 n=10+10)
DigestBytes/4KB-64    3.27GB/s ± 0%  3.26GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.403 n=10+10)
DigestString/4B-64     200MB/s ± 0%   199MB/s ± 0%   -0.24%  (p=0.000 n=8+10)
DigestString/16B-64    645MB/s ± 0%   646MB/s ± 0%     ~     (p=0.065 n=9+10)
DigestString/100B-64  1.82GB/s ± 0%  1.71GB/s ± 1%   -6.20%  (p=0.000 n=10+10)
DigestString/4KB-64   3.27GB/s ± 0%  3.26GB/s ± 0%   -0.30%  (p=0.000 n=8+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   +0.01%  (p=0.002 n=10+9)

We still see that the asm is significantly faster only for small strings (4 and 16 bytes) when using the one-shot hashing functions and the advantage has narrowed a little due to compiler improvements for arm64 in Go 1.18. There is also a measurable slowdown from the asm in a couple of the benchmarks.

For context, here is a comparison on amd64 (using a c5.metal instance) between purego and asm:

name                  old speed      new speed       delta
Sum64/4B-96            882MB/s ± 1%   1065MB/s ± 1%  +20.74%  (p=0.000 n=9+9)
Sum64/16B-96          2.41GB/s ± 0%   3.01GB/s ± 2%  +25.25%  (p=0.000 n=10+10)
Sum64/100B-96         5.66GB/s ± 1%   6.60GB/s ± 1%  +16.58%  (p=0.000 n=8+9)
Sum64/4KB-96          10.0GB/s ± 1%   14.5GB/s ± 0%  +45.92%  (p=0.000 n=10+8)
Sum64/10MB-96         10.2GB/s ± 1%   15.1GB/s ± 0%  +47.33%  (p=0.000 n=9+9)
Sum64String/4B-96      838MB/s ± 0%    902MB/s ± 0%   +7.64%  (p=0.000 n=8+9)
Sum64String/16B-96    2.32GB/s ± 1%   2.70GB/s ± 1%  +16.43%  (p=0.000 n=9+10)
Sum64String/100B-96   5.59GB/s ± 0%   6.21GB/s ± 3%  +11.24%  (p=0.000 n=8+10)
Sum64String/4KB-96    10.0GB/s ± 0%   14.5GB/s ± 0%  +45.67%  (p=0.000 n=7+10)
Sum64String/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.09%  (p=0.000 n=9+9)
DigestBytes/4B-96      391MB/s ± 1%    393MB/s ± 0%   +0.51%  (p=0.004 n=9+9)
DigestBytes/16B-96    1.29GB/s ± 1%   1.29GB/s ± 1%     ~     (p=0.780 n=9+10)
DigestBytes/100B-96   3.94GB/s ± 1%   4.37GB/s ± 2%  +10.83%  (p=0.000 n=8+10)
DigestBytes/4KB-96    9.79GB/s ± 1%  14.08GB/s ± 0%  +43.82%  (p=0.000 n=9+9)
DigestBytes/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.08%  (p=0.000 n=10+10)
DigestString/4B-96     382MB/s ± 1%    385MB/s ± 1%   +0.82%  (p=0.008 n=8+9)
DigestString/16B-96   1.26GB/s ± 2%   1.27GB/s ± 1%     ~     (p=0.079 n=9+10)
DigestString/100B-96  3.92GB/s ± 2%   4.14GB/s ± 1%   +5.53%  (p=0.000 n=10+10)
DigestString/4KB-96   9.79GB/s ± 1%  14.00GB/s ± 0%  +43.04%  (p=0.000 n=8+9)
DigestString/10MB-96  10.3GB/s ± 1%   15.1GB/s ± 1%  +47.40%  (p=0.000 n=9+9)

You can see that the assembly version offers a clear benefit over pure Go across the board, with the largest improvement going to the biggest inputs.


I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

That said, I would like to sit down and understand the implementation and see if there are ways to improve it, particularly the handling of full blocks. I'll try to make some time for that soon.

@AWSjswinney
Copy link

I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

This is a valid concern. Let me add another data point. Graviton 2 and the Raspberry Pi 4 may not see significant improvement because they are most likely bottlenecking on the floating-point/simd pipelines. Other CPUs, such as Graviton 3, will see improvement from this change.

I tested this code as written on a c7g.2xlarge and go 1.17.

name                 old time/op    new time/op    delta
Sum64/4B-8             5.13ns ± 0%    3.09ns ± 0%  -39.82%  (p=0.000 n=10+10)
Sum64/100B-8           20.8ns ± 0%    12.4ns ± 0%  -40.43%  (p=0.000 n=10+9)
Sum64/4KB-8             351ns ± 0%     248ns ± 0%  -29.39%  (p=0.000 n=10+10)
Sum64/10MB-8            850µs ± 0%     602µs ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/4B-8       5.39ns ± 0%    3.61ns ± 0%  -33.00%  (p=0.000 n=10+10)
Sum64String/100B-8     21.2ns ± 0%    13.1ns ± 0%  -38.17%  (p=0.000 n=10+10)
Sum64String/4KB-8       351ns ± 0%     248ns ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/10MB-8      850µs ± 0%     603µs ± 0%  -29.10%  (p=0.000 n=10+10)
DigestBytes/4B-8       12.6ns ± 0%    12.6ns ± 0%   +0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8     27.4ns ± 0%    24.6ns ± 0%  -10.00%  (p=0.000 n=9+10)
DigestBytes/4KB-8       355ns ± 0%     258ns ± 0%  -27.48%  (p=0.000 n=10+10)
DigestBytes/10MB-8      850µs ± 0%     602µs ± 0%  -29.18%  (p=0.000 n=10+9)
DigestString/4B-8      15.9ns ± 0%    15.9ns ± 0%   +0.10%  (p=0.041 n=10+10)
DigestString/100B-8    27.8ns ± 0%    25.9ns ± 0%   -6.95%  (p=0.000 n=10+9)
DigestString/4KB-8      356ns ± 0%     259ns ± 0%  -27.15%  (p=0.000 n=9+10)
DigestString/10MB-8     849µs ± 0%     602µs ± 0%  -29.03%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
Sum64/4B-8            779MB/s ± 0%  1295MB/s ± 0%  +66.18%  (p=0.000 n=10+10)
Sum64/100B-8         4.80GB/s ± 0%  8.06GB/s ± 0%  +67.86%  (p=0.000 n=10+9)
Sum64/4KB-8          11.4GB/s ± 0%  16.2GB/s ± 0%  +41.62%  (p=0.000 n=9+10)
Sum64/10MB-8         11.8GB/s ± 0%  16.6GB/s ± 0%  +41.29%  (p=0.000 n=10+10)
Sum64String/4B-8      742MB/s ± 0%  1108MB/s ± 0%  +49.26%  (p=0.000 n=10+10)
Sum64String/100B-8   4.73GB/s ± 0%  7.64GB/s ± 0%  +61.72%  (p=0.000 n=9+10)
Sum64String/4KB-8    11.4GB/s ± 0%  16.1GB/s ± 0%  +41.27%  (p=0.000 n=9+10)
Sum64String/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.04%  (p=0.000 n=10+10)
DigestBytes/4B-8      319MB/s ± 0%   318MB/s ± 0%   -0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8   3.65GB/s ± 0%  4.06GB/s ± 0%  +11.11%  (p=0.000 n=9+10)
DigestBytes/4KB-8    11.3GB/s ± 0%  15.5GB/s ± 0%  +37.90%  (p=0.000 n=10+10)
DigestBytes/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.20%  (p=0.000 n=10+9)
DigestString/4B-8     252MB/s ± 0%   252MB/s ± 0%     ~     (p=0.072 n=10+10)
DigestString/100B-8  3.60GB/s ± 0%  3.87GB/s ± 0%   +7.46%  (p=0.000 n=9+9)
DigestString/4KB-8   11.2GB/s ± 0%  15.4GB/s ± 0%  +37.27%  (p=0.000 n=10+10)
DigestString/10MB-8  11.8GB/s ± 0%  16.6GB/s ± 0%  +40.90%  (p=0.000 n=10+10)

This instance type is currently in customer preview, so it's not widely available, but once it is, this code will be definitely worthwhile.

@AWSjswinney
Copy link

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

There aren't NEON instructions that support 64bit in and out multiply-adds, so using these scalar instructions as written is probably the best we can do for now. However, SVE does include instructions like this, so it's possible a newer implementation could see even better performance on CPUs with SVE support.

@cespare
Copy link
Owner

cespare commented Jan 25, 2022

@AWSjswinney very nice! That's helpful and does give a lot more motivation for this change. (Also, the absolute numbers from that chip look great.)

@lizthegrey
Copy link

SVE

The challenge is going to be detecting whether the CPU supports SVE, as well as Go implementing support for the SVE opcodes. Until then I think we have to default compatible and just go with what's supported by default in Go. See golang/go#44734

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

@cespare
Copy link
Owner

cespare commented Jan 27, 2022

The challenge is going to be detecting whether the CPU supports SVE

Some googling led me to this page, which says:

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

We could probably do that at startup. But a question is whether adding a (predictable) branch to every hash function call is too slow.

as well as Go implementing support for the SVE opcodes.

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

As far as I know those env variables are only useful for telling the compiler what instructions it's allowed to use. A missing piece here is having build tags to select certain CPU features. That's tracked by golang/go#45454. So we'd need both new GOARM64 variants that allow for selecting for some kind of baseline feature level that includes SVE and golang/go#45454 to make that available as a build tag. With all of that we'd be able to avoid runtime feature detection.

@klauspost
Copy link

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

Careful. This is a protected instruction, and requires the OS to intercept the call and handle it. Not all operating systems do that, so your program can crash on init.

For cpuid it is implemented rather complex. There may be simpler methods, but this is what I found - not being any sort of arm expert.

On Linux, first we attempt it by hooking up to the internal runtime cpu detection, which gets populated by the runtime with data that isn't exposed publicly. AFAICT there is no other way to get this information:

//go:linkname hwcap internal/cpu.HWCap
var hwcap uint

If this isn't set /proc/self/auxv is read (not available on Android, though, so it must fail gracefully).

If CPUID flag isn't set by some of the above, we do not read MIDR_EL1, since it may crash.

For OSX, I have not found any method for determining features, so just a base set is enabled.

@lizthegrey
Copy link

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

That has a bit of a cost in terms of readability, but it'll work in the short term...

in any event, since there's some benefit for some real world use cases, no evidence of significant harm, and it's bottlenecking on floating point math on all but the newest hardware as @AWSjswinney says but there are significant speedups on the newest hardware, we should probably make the decision to go ahead and merge this as it is?

@AWSjswinney
Copy link

Adding an SVE implementation would be speculative at this point and it wouldn't be beneficial for a while until CPUs with SVE are widely available. By that time, the Go assembler should support all of the necessary SVE instructions. I haven't looked at the amd64 implementation much, but I suspect there may be something similar possible for that architecture with SSE or AVX. It's just an idea for the future.

I definitely think it makes sense to merge as is for now.

@lizthegrey
Copy link

Ping @cespare, any chance we can merge this so we can keep on iterating on it in future? The timing differences are basically noise if the bottleneck is the multiply units on older hardware, and it shows substantial benefit when the multiply unit bottleneck is removed by newest generation hardware.

@cespare
Copy link
Owner

cespare commented Feb 23, 2022

@lizthegrey this remains on my TODO list. I need to understand and review the code.

@lizthegrey
Copy link

lizthegrey commented May 24, 2022

c7g graviton3 is now generally available. Profiling shows that this assembly code is now running at least 17% faster than on Graviton2.

https://flamegraph.com/share/18a16815-db9e-11ec-b50b-a2e3bd482a0d

@snadampal
Copy link

Any hope to get this reviewed and merged soon?

Copy link
Owner

@cespare cespare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @greatroar, awesome work on this! I finally had a chance to sit down with this code (and an aarch64 manual) and wrap my head around it. I learned a lot! This code is really nice.

I have several questions for you about things which seem like they can be simplified. They mostly seem like things you probably did for performance reasons but in my tests they didn't make a difference so I want to understand your reasoning in order to decide whether to apply these simplifications.

I'm hoping you can answer my questions but after that I'm happy to merge this and make any of the changes I mentioned myself as a follow-up. (My WIP branch for that is arm64-merge.)

For the record here are the benchmark results I'm currently looking at, comparing the pure Go version against your arm64 asm version (plus my tweaks, which make approximately no difference).

A c6g instance (graviton2):

name                 old speed      new speed      delta
Sum64/4B-4            476MB/s ± 0%   676MB/s ± 0%  +41.98%  (p=0.001 n=7+7)
Sum64/16B-4          1.12GB/s ± 0%  1.61GB/s ± 0%  +43.63%  (p=0.001 n=7+7)
Sum64/100B-4         2.04GB/s ± 0%  2.07GB/s ± 0%   +1.38%  (p=0.001 n=7+7)
Sum64/4KB-4          3.28GB/s ± 0%  3.27GB/s ± 0%   -0.09%  (p=0.001 n=7+6)
Sum64/10MB-4         3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.078 n=7+7)
Sum64String/4B-4      449MB/s ± 0%   491MB/s ± 0%   +9.34%  (p=0.001 n=7+6)
Sum64String/16B-4    1.09GB/s ± 0%  1.38GB/s ± 0%  +26.67%  (p=0.001 n=7+7)
Sum64String/100B-4   2.03GB/s ± 0%  2.08GB/s ± 0%   +2.63%  (p=0.001 n=7+7)
Sum64String/4KB-4    3.27GB/s ± 0%  3.27GB/s ± 0%   -0.01%  (p=0.001 n=6+7)
Sum64String/10MB-4   3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.073 n=6+7)
DigestBytes/4B-4      210MB/s ± 0%   209MB/s ± 0%   -0.36%  (p=0.001 n=7+6)
DigestBytes/16B-4     673MB/s ± 0%   671MB/s ± 0%   -0.41%  (p=0.001 n=7+7)
DigestBytes/100B-4   1.84GB/s ± 0%  1.80GB/s ± 0%   -2.53%  (p=0.001 n=7+7)
DigestBytes/4KB-4    3.26GB/s ± 0%  3.26GB/s ± 0%   -0.07%  (p=0.001 n=7+7)
DigestBytes/10MB-4   3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.805 n=7+7)
DigestString/4B-4     202MB/s ± 0%   201MB/s ± 0%   -0.24%  (p=0.001 n=7+7)
DigestString/16B-4    652MB/s ± 0%   649MB/s ± 0%   -0.48%  (p=0.001 n=7+7)
DigestString/100B-4  1.81GB/s ± 0%  1.71GB/s ± 0%   -5.52%  (p=0.001 n=7+7)
DigestString/4KB-4   3.26GB/s ± 0%  3.25GB/s ± 0%   -0.31%  (p=0.001 n=7+7)
DigestString/10MB-4  3.32GB/s ± 0%  3.32GB/s ± 0%   +0.01%  (p=0.036 n=7+6)

A c7g instance (graviton3):

name                 old speed      new speed      delta
Sum64/4B-4            876MB/s ± 0%  1154MB/s ± 0%  +31.78%  (p=0.001 n=7+7)
Sum64/16B-4          1.98GB/s ± 0%  3.79GB/s ± 0%  +91.25%  (p=0.001 n=7+7)
Sum64/100B-4         4.99GB/s ± 0%  7.99GB/s ± 0%  +60.11%  (p=0.001 n=7+6)
Sum64/4KB-4          11.0GB/s ± 0%  16.2GB/s ± 0%  +46.58%  (p=0.004 n=5+6)
Sum64/10MB-4         11.2GB/s ± 1%  17.0GB/s ± 3%  +51.34%  (p=0.001 n=7+7)
Sum64String/4B-4      844MB/s ± 0%  1075MB/s ± 0%  +27.37%  (p=0.001 n=6+7)
Sum64String/16B-4    1.94GB/s ± 0%  3.52GB/s ± 0%  +81.44%  (p=0.001 n=7+7)
Sum64String/100B-4   4.90GB/s ± 0%  7.47GB/s ± 1%  +52.43%  (p=0.001 n=7+7)
Sum64String/4KB-4    11.0GB/s ± 0%  16.1GB/s ± 0%  +46.25%  (p=0.001 n=6+7)
Sum64String/10MB-4   11.2GB/s ± 1%  17.2GB/s ± 1%  +53.87%  (p=0.001 n=7+7)
DigestBytes/4B-4      381MB/s ± 0%   376MB/s ± 0%   -1.21%  (p=0.001 n=7+7)
DigestBytes/16B-4    1.23GB/s ± 0%  1.24GB/s ± 0%   +0.64%  (p=0.002 n=6+6)
DigestBytes/100B-4   4.22GB/s ± 0%  4.30GB/s ± 0%   +1.96%  (p=0.003 n=5+7)
DigestBytes/4KB-4    10.1GB/s ± 0%  15.6GB/s ± 0%  +54.83%  (p=0.001 n=7+7)
DigestBytes/10MB-4   10.3GB/s ± 1%  16.6GB/s ± 1%  +61.03%  (p=0.001 n=7+6)
DigestString/4B-4     339MB/s ± 0%   336MB/s ± 0%   -0.93%  (p=0.008 n=5+5)
DigestString/16B-4   1.21GB/s ± 0%  1.23GB/s ± 0%   +1.87%  (p=0.001 n=7+6)
DigestString/100B-4  4.13GB/s ± 0%  4.16GB/s ± 0%   +0.59%  (p=0.001 n=7+7)
DigestString/4KB-4   10.1GB/s ± 0%  15.5GB/s ± 0%  +54.32%  (p=0.001 n=7+6)
DigestString/10MB-4  10.4GB/s ± 0%  16.5GB/s ± 1%  +59.01%  (p=0.001 n=6+7)

xxhash_arm64.s Show resolved Hide resolved
xxhash_arm64.s Show resolved Hide resolved
xxhash_arm64.s Show resolved Hide resolved
xxhash_arm64.s Show resolved Hide resolved
xxhash_arm64.s Outdated Show resolved Hide resolved

round0(x1)
ROR $64-27, h
EOR x1 @> 64-27, h, h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where it seems like you have a slightly convoluted sequence which I'm guessing is a performance optimization? To my mind, the obvious way to write these two lines would be

EOR x1, h
ROR $64-27, h

(and the same for many similar pairs of ROR/EOR below).

Same as before, my benchmarks showed essentially no difference with my simplified version. Were you seeing noticeable improvements from sequencing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is

ROR $64-27, x1
EOR x1, h

except that it doesn't write to x1, but that doesn't matter here.

I don't recall benchmarking this specifically, but I've found that code density in general can matter a lot on smaller devices.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is

ROR $64-27, x1
EOR x1, h

I don't think that's correct. I double-checked that my two instructions do the same thing as the original (effectively: h = rol27(h^x1)) but mine seems simpler because it's doing the xor first so it doesn't need the ROR addressing mode on the EOR instruction. Your two instructions above seem to be doing something different.

I've found that code density in general can matter a lot on smaller devices.

What do you mean by code density here? It's two instructions either way, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Sorry, I thought you meant only the last instruction.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing more thorough benchmarking I found that ordering the EOR after the ROR with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.

Benchmark results on Raspberry Pi 4B, Linux, Go 1.17.1:

name                 old speed      new speed      delta
Sum64/4B-4            180MB/s ± 0%   251MB/s ± 0%  +39.13%  (p=0.000 n=10+10)
Sum64/100B-4          994MB/s ± 0%  1135MB/s ± 0%  +14.25%  (p=0.000 n=10+9)
Sum64/4KB-4          1.92GB/s ± 0%  1.93GB/s ± 0%   +0.43%  (p=0.000 n=10+10)
Sum64/10MB-4         1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.754 n=10+10)
Sum64String/4B-4      133MB/s ± 4%   228MB/s ± 0%  +71.37%  (p=0.000 n=10+9)
Sum64String/100B-4    949MB/s ± 0%  1103MB/s ± 0%  +16.17%  (p=0.000 n=10+10)
Sum64String/4KB-4    1.92GB/s ± 0%  1.93GB/s ± 0%   +0.40%  (p=0.000 n=9+8)
Sum64String/10MB-4   1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.146 n=10+8)
DigestBytes/4B-4     61.9MB/s ± 0%  61.9MB/s ± 0%     ~     (p=0.158 n=10+9)
DigestBytes/100B-4    695MB/s ± 0%   719MB/s ± 0%   +3.37%  (p=0.000 n=10+10)
DigestBytes/4KB-4    1.89GB/s ± 0%  1.90GB/s ± 0%   +0.43%  (p=0.000 n=9+10)
DigestBytes/10MB-4   1.88GB/s ± 0%  1.89GB/s ± 0%   +0.92%  (p=0.000 n=10+9)
DigestString/4B-4    58.9MB/s ± 0%  58.5MB/s ± 1%   -0.60%  (p=0.032 n=8+10)
DigestString/100B-4   669MB/s ± 0%   696MB/s ± 1%   +4.05%  (p=0.000 n=10+10)
DigestString/4KB-4   1.89GB/s ± 0%  1.89GB/s ± 0%   +0.34%  (p=0.000 n=10+10)
DigestString/10MB-4  1.88GB/s ± 0%  1.89GB/s ± 0%   +0.90%  (p=0.000 n=10+10)
@cespare cespare merged commit 3211b29 into cespare:main Dec 4, 2022
Copy link
Owner

@cespare cespare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this contribution @greatroar! I merged this and released v2.2.0.


round0(x1)
ROR $64-27, h
EOR x1 @> 64-27, h, h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing more thorough benchmarking I found that ordering the EOR after the ROR with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.

klauspost pushed a commit to klauspost/compress that referenced this pull request Dec 4, 2022
* [zstd] sync xxhash with final accepted patch upstream

Syncs with cespare/xxhash#51

* asmfmt
@greatroar greatroar deleted the asm-arm64 branch December 4, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an arm64 assembly implementation
7 participants