Assembler implementation for arm64 #51

greatroar · 2020-11-20T10:58:07Z

Fixes #45, but I haven't yet benchmarked this on actual ARM hardware. On amd64 with qemu, I get the following results after cherry-picking #50 (I have a version that includes a variant of #42 as well):

name                 old speed      new speed      delta
Sum64/4B-8            104MB/s ± 1%   142MB/s ± 1%  +36.00%  (p=0.000 n=10+10)
Sum64/100B-8         1.51GB/s ± 2%  1.97GB/s ± 1%  +30.42%  (p=0.000 n=10+10)
Sum64/4KB-8          4.66GB/s ± 1%  5.48GB/s ± 1%  +17.56%  (p=0.000 n=9+9)
Sum64/10MB-8         4.79GB/s ± 0%  5.45GB/s ± 0%  +13.78%  (p=0.000 n=10+9)
Sum64String/4B-8      108MB/s ± 1%   149MB/s ± 1%  +38.39%  (p=0.000 n=10+8)
Sum64String/100B-8   1.59GB/s ± 2%  2.01GB/s ± 1%  +25.84%  (p=0.000 n=9+9)
Sum64String/4KB-8    4.67GB/s ± 1%  5.54GB/s ± 1%  +18.76%  (p=0.000 n=9+10)
Sum64String/10MB-8   4.78GB/s ± 1%  5.46GB/s ± 0%  +14.08%  (p=0.000 n=9+10)
DigestBytes/4B-8     37.4MB/s ± 1%  37.7MB/s ± 1%     ~     (p=0.093 n=10+10)
DigestBytes/100B-8    634MB/s ± 0%   657MB/s ± 0%   +3.57%  (p=0.000 n=10+9)
DigestBytes/4KB-8    4.33GB/s ± 0%  4.87GB/s ± 1%  +12.50%  (p=0.000 n=10+9)
DigestBytes/10MB-8   4.90GB/s ± 0%  5.47GB/s ± 1%  +11.54%  (p=0.000 n=10+10)
DigestString/4B-8    31.5MB/s ± 1%  31.6MB/s ± 1%     ~     (p=0.447 n=10+10)
DigestString/100B-8   561MB/s ± 1%   576MB/s ± 1%   +2.67%  (p=0.000 n=10+9)
DigestString/4KB-8   4.22GB/s ± 1%  4.72GB/s ± 0%  +11.77%  (p=0.000 n=9+9)
DigestString/10MB-8  4.90GB/s ± 0%  5.47GB/s ± 0%  +11.56%  (p=0.000 n=9+9)

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

cespare · 2020-11-20T20:59:44Z

Awesome! It might take me a while to find the time to review this because I'm not too familiar with arm64 yet.

I haven't yet benchmarked this on actual ARM hardware.

I don't have physical access to hardware, but I can test on graviton2 instances in EC2 later if that helps.

cespare · 2021-09-09T21:10:28Z

@greatroar sorry for the delay. I forgot about this for a while. I started taking a look but it will take me a bit to get up to speed on arm assembly. I'm pretty swamped with work atm so it may be a bit before I finish reviewing. Feel free to ping me if you haven't heard back in a couple of weeks.

greatroar · 2021-10-30T16:17:12Z

Rebased to fix the merge commit. I've also finally found time to run this on actual hardware, a Raspberry Pi 4B. Benchmark results:

name                 old speed      new speed      delta
Sum64/4B-4            180MB/s ± 0%   251MB/s ± 0%  +39.13%  (p=0.000 n=10+10)
Sum64/100B-4          994MB/s ± 0%  1135MB/s ± 0%  +14.25%  (p=0.000 n=10+9)
Sum64/4KB-4          1.92GB/s ± 0%  1.93GB/s ± 0%   +0.43%  (p=0.000 n=10+10)
Sum64/10MB-4         1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.754 n=10+10)
Sum64String/4B-4      133MB/s ± 4%   228MB/s ± 0%  +71.37%  (p=0.000 n=10+9)
Sum64String/100B-4    949MB/s ± 0%  1103MB/s ± 0%  +16.17%  (p=0.000 n=10+10)
Sum64String/4KB-4    1.92GB/s ± 0%  1.93GB/s ± 0%   +0.40%  (p=0.000 n=9+8)
Sum64String/10MB-4   1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.146 n=10+8)
DigestBytes/4B-4     61.9MB/s ± 0%  61.9MB/s ± 0%     ~     (p=0.158 n=10+9)
DigestBytes/100B-4    695MB/s ± 0%   719MB/s ± 0%   +3.37%  (p=0.000 n=10+10)
DigestBytes/4KB-4    1.89GB/s ± 0%  1.90GB/s ± 0%   +0.43%  (p=0.000 n=9+10)
DigestBytes/10MB-4   1.88GB/s ± 0%  1.89GB/s ± 0%   +0.92%  (p=0.000 n=10+9)
DigestString/4B-4    58.9MB/s ± 0%  58.5MB/s ± 1%   -0.60%  (p=0.032 n=8+10)
DigestString/100B-4   669MB/s ± 0%   696MB/s ± 1%   +4.05%  (p=0.000 n=10+10)
DigestString/4KB-4   1.89GB/s ± 0%  1.89GB/s ± 0%   +0.34%  (p=0.000 n=10+10)
DigestString/10MB-4  1.88GB/s ± 0%  1.89GB/s ± 0%   +0.90%  (p=0.000 n=10+10)

I'll admit this is disappointing, but there is a speedup for small inputs. I find it hard to explain why the speedup stops at 4KiB. This CPU has a 32KiB L1 cache, so it shouldn't be running at L2 speed for that input size.

cespare · 2021-10-30T17:00:00Z

Hey @greatroar, sorry I still haven't had time to review the code. It's still on my radar.

I did have a few minutes to spin up an EC2 instance to try the benchmarks on a beefier arm64 chip (I used a c6g.metal instance). The results are similar to your raspberry pi results but even more pronounced:

name                  old speed      new speed      delta
Sum64/4B-64            435MB/s ± 1%   680MB/s ± 0%  +56.47%  (p=0.000 n=10+10)
Sum64/100B-64         2.01GB/s ± 0%  2.07GB/s ± 0%   +3.17%  (p=0.000 n=9+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.03%  (p=0.000 n=8+8)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.000 n=9+9)
Sum64String/4B-64      373MB/s ± 0%   501MB/s ± 2%  +34.29%  (p=0.000 n=8+10)
Sum64String/100B-64   1.99GB/s ± 0%  2.08GB/s ± 0%   +4.89%  (p=0.000 n=9+8)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.11%  (p=0.000 n=10+8)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.006 n=10+9)
DigestBytes/4B-64      179MB/s ± 0%   179MB/s ± 0%   -0.07%  (p=0.001 n=7+7)
DigestBytes/100B-64   1.63GB/s ± 0%  1.64GB/s ± 0%   +0.53%  (p=0.000 n=9+8)
DigestBytes/4KB-64    3.25GB/s ± 0%  3.25GB/s ± 0%     ~     (p=0.138 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.051 n=10+8)
DigestString/4B-64     139MB/s ± 3%   137MB/s ± 2%     ~     (p=0.497 n=10+9)
DigestString/100B-64  1.52GB/s ± 1%  1.53GB/s ± 1%   +1.03%  (p=0.001 n=10+10)
DigestString/4KB-64   3.24GB/s ± 0%  3.23GB/s ± 0%   -0.07%  (p=0.000 n=9+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   -0.00%  (p=0.006 n=9+10)

It's suspicious that the speed is so precisely similar for larger block sizes. Seems like we must be hitting a hard bottleneck of some kind or else the generated code on the main path must be almost the same as the handwritten code.

greatroar · 2021-10-31T09:53:28Z

The generated code is very similar, the main difference being the use of LDP/STP to transfer 16-byte blocks in the handwritten code. That doesn't seem to buy much. The reason it's faster on smaller inputs may be that the handwritten code does everything in registers, while the compiled code spills some stuff on the stack, and it loads the primes faster.

Imanmahmoodi · 2021-12-27T04:20:52Z

😃

lizthegrey · 2022-01-08T13:54:43Z

Please let me know if you need help testing this; always happy to help volunteer to make things run faster on arm64.

see cespare/xxhash#51 ``` benchmark old ns/op new ns/op delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 14740529 14093294 -4.39% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 3003737 3008035 +0.14% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 70052218 70885931 +1.19% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 33424884 34714144 +3.86% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 9694846 9563735 -1.35% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 12322865 12782681 +3.73% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 7290319 7134523 -2.14% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 825251 816683 -1.04% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 462903 486823 +5.17% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 1333159120 1114784390 -16.38% BenchmarkDecoder_DecoderSmall/html.zst-16 3117950 3095118 -0.73% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 272085 271508 -0.21% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 1411294 1407632 -0.26% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 370870 367499 -0.91% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 4721330 4718339 -0.06% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 3517766 3487756 -0.85% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 1186672 1180367 -0.53% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 1498383 1502922 +0.30% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 748113 742537 -0.75% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 99132 98206 -0.93% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 53805 53209 -1.11% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 4041818 4028347 -0.33% BenchmarkDecoder_DecodeAll/html.zst-16 387794 383309 -1.16% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 34390 34296 -0.27% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 89510 88785 -0.81% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 23315 23128 -0.80% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 306437 325176 +6.12% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 226433 222589 -1.70% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 74612 74182 -0.58% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 95066 94304 -0.80% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 47498 46946 -1.16% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 6291 6237 -0.86% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 3498 3453 -1.29% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 295850 318916 +7.80% BenchmarkDecoder_DecodeAllParallel/html.zst-16 24340 24196 -0.59% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 2220 2199 -0.95% benchmark old MB/s new MB/s speedup BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 100.03 104.63 1.05x BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 315.84 315.39 1.00x BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 55.03 54.38 0.99x BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 102.14 98.35 0.96x BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 103.30 104.71 1.01x BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 98.74 95.18 0.96x BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 449.47 459.29 1.02x BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 992.67 1003.08 1.01x BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 2127.32 2022.80 0.95x BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 4.21 5.04 1.20x BenchmarkDecoder_DecoderSmall/html.zst-16 262.74 264.67 1.01x BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 119.84 120.10 1.00x BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 130.60 130.94 1.00x BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 319.76 322.69 1.01x BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 102.06 102.13 1.00x BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 121.31 122.36 1.01x BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 105.49 106.05 1.01x BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 101.50 101.20 1.00x BenchmarkDecoder_DecodeAll/html_x_4.zst-16 547.51 551.62 1.01x BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 1032.97 1042.70 1.01x BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 2287.77 2313.39 1.01x BenchmarkDecoder_DecodeAll/urls.10K.zst-16 173.71 174.29 1.00x BenchmarkDecoder_DecodeAll/html.zst-16 264.06 267.15 1.01x BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 118.52 118.85 1.00x BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 2059.22 2076.03 1.01x BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 5086.37 5127.46 1.01x BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 1572.47 1481.85 0.94x BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 1884.68 1917.22 1.02x BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 1677.73 1687.46 1.01x BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 1599.82 1612.75 1.01x BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 8623.46 8724.94 1.01x BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 16276.14 16418.01 1.01x BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 35194.18 35648.95 1.01x BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 2373.12 2201.48 0.93x BenchmarkDecoder_DecodeAllParallel/html.zst-16 4207.06 4232.02 1.01x BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 1835.78 1853.29 1.01x benchmark old allocs new allocs delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 1 3 +200.00% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 4 5 +25.00% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 52 50 -3.85% BenchmarkDecoder_DecoderSmall/html.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 1 1 +0.00% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 0 0 +0.00% benchmark old bytes new bytes delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 89288 113942 +27.61% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 10888 12311 +13.07% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 981385 1492955 +52.13% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 385552 367000 -4.81% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 30752 30503 -0.81% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 108837 56868 -47.75% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 75923 97336 +28.20% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 48 48 +0.00% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 1964 1909 -2.80% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 25456880 25432640 -0.10% BenchmarkDecoder_DecoderSmall/html.zst-16 48 48 +0.00% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 51 48 -5.88% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 5 0 -100.00% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 2 0 -100.00% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 26 0 -100.00% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 10 0 -100.00% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 1 0 -100.00% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 234 231 -1.28% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 38 38 +0.00% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 1991 2287 +14.87% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 1501 1298 -13.52% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 132 130 -1.52% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 214 220 +2.80% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 259 258 -0.39% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 9 9 +0.00% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 6 6 +0.00% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 2868 3358 +17.09% BenchmarkDecoder_DecodeAllParallel/html.zst-16 34 34 +0.00% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 0 0 +0.00% ```

cespare · 2022-01-10T18:11:05Z

@lizthegrey the problem is more that the testing has shown that the improvement is marginal.

lizthegrey · 2022-01-10T21:52:04Z

We see a 7% improvement in time spent in xxhash from adopting this within the @klauspost zstd library. https://share.getcloudapp.com/GGuEkRnR

IMO this is worth merging.

cespare · 2022-01-11T02:19:29Z

@lizthegrey it would be helpful to know how the results you've screenshotted correspond to the microbenchmarks we've been discussing above. Do the zstd benchmarks run xxhash over mostly very short strings, for instance?

I took another look at the benchmarks using Go 1.18beta1. (I also added a 16-byte comparison on main.) Here are the results, again using a c6g.metal EC2 instance:

name                  old speed      new speed      delta
Sum64/4B-64            469MB/s ± 0%   672MB/s ± 0%  +43.24%  (p=0.000 n=8+10)
Sum64/16B-64          1.12GB/s ± 0%  1.61GB/s ± 0%  +43.24%  (p=0.000 n=9+10)
Sum64/100B-64         2.04GB/s ± 0%  2.07GB/s ± 0%   +1.67%  (p=0.000 n=10+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.094 n=10+9)
Sum64String/4B-64      441MB/s ± 0%   499MB/s ± 0%  +13.33%  (p=0.000 n=10+7)
Sum64String/16B-64    1.08GB/s ± 1%  1.43GB/s ± 0%  +32.17%  (p=0.000 n=10+10)
Sum64String/100B-64   2.02GB/s ± 0%  2.08GB/s ± 0%   +3.12%  (p=0.000 n=9+10)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.00%  (p=0.028 n=9+9)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.148 n=10+10)
DigestBytes/4B-64      207MB/s ± 1%   206MB/s ± 0%   -0.42%  (p=0.000 n=10+9)
DigestBytes/16B-64     669MB/s ± 0%   666MB/s ± 0%   -0.52%  (p=0.000 n=10+8)
DigestBytes/100B-64   1.85GB/s ± 0%  1.80GB/s ± 0%   -2.84%  (p=0.000 n=10+10)
DigestBytes/4KB-64    3.27GB/s ± 0%  3.26GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.403 n=10+10)
DigestString/4B-64     200MB/s ± 0%   199MB/s ± 0%   -0.24%  (p=0.000 n=8+10)
DigestString/16B-64    645MB/s ± 0%   646MB/s ± 0%     ~     (p=0.065 n=9+10)
DigestString/100B-64  1.82GB/s ± 0%  1.71GB/s ± 1%   -6.20%  (p=0.000 n=10+10)
DigestString/4KB-64   3.27GB/s ± 0%  3.26GB/s ± 0%   -0.30%  (p=0.000 n=8+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   +0.01%  (p=0.002 n=10+9)

We still see that the asm is significantly faster only for small strings (4 and 16 bytes) when using the one-shot hashing functions and the advantage has narrowed a little due to compiler improvements for arm64 in Go 1.18. There is also a measurable slowdown from the asm in a couple of the benchmarks.

For context, here is a comparison on amd64 (using a c5.metal instance) between purego and asm:

name                  old speed      new speed       delta
Sum64/4B-96            882MB/s ± 1%   1065MB/s ± 1%  +20.74%  (p=0.000 n=9+9)
Sum64/16B-96          2.41GB/s ± 0%   3.01GB/s ± 2%  +25.25%  (p=0.000 n=10+10)
Sum64/100B-96         5.66GB/s ± 1%   6.60GB/s ± 1%  +16.58%  (p=0.000 n=8+9)
Sum64/4KB-96          10.0GB/s ± 1%   14.5GB/s ± 0%  +45.92%  (p=0.000 n=10+8)
Sum64/10MB-96         10.2GB/s ± 1%   15.1GB/s ± 0%  +47.33%  (p=0.000 n=9+9)
Sum64String/4B-96      838MB/s ± 0%    902MB/s ± 0%   +7.64%  (p=0.000 n=8+9)
Sum64String/16B-96    2.32GB/s ± 1%   2.70GB/s ± 1%  +16.43%  (p=0.000 n=9+10)
Sum64String/100B-96   5.59GB/s ± 0%   6.21GB/s ± 3%  +11.24%  (p=0.000 n=8+10)
Sum64String/4KB-96    10.0GB/s ± 0%   14.5GB/s ± 0%  +45.67%  (p=0.000 n=7+10)
Sum64String/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.09%  (p=0.000 n=9+9)
DigestBytes/4B-96      391MB/s ± 1%    393MB/s ± 0%   +0.51%  (p=0.004 n=9+9)
DigestBytes/16B-96    1.29GB/s ± 1%   1.29GB/s ± 1%     ~     (p=0.780 n=9+10)
DigestBytes/100B-96   3.94GB/s ± 1%   4.37GB/s ± 2%  +10.83%  (p=0.000 n=8+10)
DigestBytes/4KB-96    9.79GB/s ± 1%  14.08GB/s ± 0%  +43.82%  (p=0.000 n=9+9)
DigestBytes/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.08%  (p=0.000 n=10+10)
DigestString/4B-96     382MB/s ± 1%    385MB/s ± 1%   +0.82%  (p=0.008 n=8+9)
DigestString/16B-96   1.26GB/s ± 2%   1.27GB/s ± 1%     ~     (p=0.079 n=9+10)
DigestString/100B-96  3.92GB/s ± 2%   4.14GB/s ± 1%   +5.53%  (p=0.000 n=10+10)
DigestString/4KB-96   9.79GB/s ± 1%  14.00GB/s ± 0%  +43.04%  (p=0.000 n=8+9)
DigestString/10MB-96  10.3GB/s ± 1%   15.1GB/s ± 1%  +47.40%  (p=0.000 n=9+9)

You can see that the assembly version offers a clear benefit over pure Go across the board, with the largest improvement going to the biggest inputs.

I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

That said, I would like to sit down and understand the implementation and see if there are ways to improve it, particularly the handling of full blocks. I'll try to make some time for that soon.

AWSjswinney · 2022-01-25T17:31:12Z

I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

This is a valid concern. Let me add another data point. Graviton 2 and the Raspberry Pi 4 may not see significant improvement because they are most likely bottlenecking on the floating-point/simd pipelines. Other CPUs, such as Graviton 3, will see improvement from this change.

I tested this code as written on a c7g.2xlarge and go 1.17.

name                 old time/op    new time/op    delta
Sum64/4B-8             5.13ns ± 0%    3.09ns ± 0%  -39.82%  (p=0.000 n=10+10)
Sum64/100B-8           20.8ns ± 0%    12.4ns ± 0%  -40.43%  (p=0.000 n=10+9)
Sum64/4KB-8             351ns ± 0%     248ns ± 0%  -29.39%  (p=0.000 n=10+10)
Sum64/10MB-8            850µs ± 0%     602µs ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/4B-8       5.39ns ± 0%    3.61ns ± 0%  -33.00%  (p=0.000 n=10+10)
Sum64String/100B-8     21.2ns ± 0%    13.1ns ± 0%  -38.17%  (p=0.000 n=10+10)
Sum64String/4KB-8       351ns ± 0%     248ns ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/10MB-8      850µs ± 0%     603µs ± 0%  -29.10%  (p=0.000 n=10+10)
DigestBytes/4B-8       12.6ns ± 0%    12.6ns ± 0%   +0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8     27.4ns ± 0%    24.6ns ± 0%  -10.00%  (p=0.000 n=9+10)
DigestBytes/4KB-8       355ns ± 0%     258ns ± 0%  -27.48%  (p=0.000 n=10+10)
DigestBytes/10MB-8      850µs ± 0%     602µs ± 0%  -29.18%  (p=0.000 n=10+9)
DigestString/4B-8      15.9ns ± 0%    15.9ns ± 0%   +0.10%  (p=0.041 n=10+10)
DigestString/100B-8    27.8ns ± 0%    25.9ns ± 0%   -6.95%  (p=0.000 n=10+9)
DigestString/4KB-8      356ns ± 0%     259ns ± 0%  -27.15%  (p=0.000 n=9+10)
DigestString/10MB-8     849µs ± 0%     602µs ± 0%  -29.03%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
Sum64/4B-8            779MB/s ± 0%  1295MB/s ± 0%  +66.18%  (p=0.000 n=10+10)
Sum64/100B-8         4.80GB/s ± 0%  8.06GB/s ± 0%  +67.86%  (p=0.000 n=10+9)
Sum64/4KB-8          11.4GB/s ± 0%  16.2GB/s ± 0%  +41.62%  (p=0.000 n=9+10)
Sum64/10MB-8         11.8GB/s ± 0%  16.6GB/s ± 0%  +41.29%  (p=0.000 n=10+10)
Sum64String/4B-8      742MB/s ± 0%  1108MB/s ± 0%  +49.26%  (p=0.000 n=10+10)
Sum64String/100B-8   4.73GB/s ± 0%  7.64GB/s ± 0%  +61.72%  (p=0.000 n=9+10)
Sum64String/4KB-8    11.4GB/s ± 0%  16.1GB/s ± 0%  +41.27%  (p=0.000 n=9+10)
Sum64String/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.04%  (p=0.000 n=10+10)
DigestBytes/4B-8      319MB/s ± 0%   318MB/s ± 0%   -0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8   3.65GB/s ± 0%  4.06GB/s ± 0%  +11.11%  (p=0.000 n=9+10)
DigestBytes/4KB-8    11.3GB/s ± 0%  15.5GB/s ± 0%  +37.90%  (p=0.000 n=10+10)
DigestBytes/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.20%  (p=0.000 n=10+9)
DigestString/4B-8     252MB/s ± 0%   252MB/s ± 0%     ~     (p=0.072 n=10+10)
DigestString/100B-8  3.60GB/s ± 0%  3.87GB/s ± 0%   +7.46%  (p=0.000 n=9+9)
DigestString/4KB-8   11.2GB/s ± 0%  15.4GB/s ± 0%  +37.27%  (p=0.000 n=10+10)
DigestString/10MB-8  11.8GB/s ± 0%  16.6GB/s ± 0%  +40.90%  (p=0.000 n=10+10)

This instance type is currently in customer preview, so it's not widely available, but once it is, this code will be definitely worthwhile.

AWSjswinney · 2022-01-25T17:34:26Z

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

There aren't NEON instructions that support 64bit in and out multiply-adds, so using these scalar instructions as written is probably the best we can do for now. However, SVE does include instructions like this, so it's possible a newer implementation could see even better performance on CPUs with SVE support.

cespare · 2022-01-25T17:57:54Z

@AWSjswinney very nice! That's helpful and does give a lot more motivation for this change. (Also, the absolute numbers from that chip look great.)

lizthegrey · 2022-01-27T07:47:42Z

SVE

The challenge is going to be detecting whether the CPU supports SVE, as well as Go implementing support for the SVE opcodes. Until then I think we have to default compatible and just go with what's supported by default in Go. See golang/go#44734

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

cespare · 2022-01-27T08:18:58Z

The challenge is going to be detecting whether the CPU supports SVE

Some googling led me to this page, which says:

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

We could probably do that at startup. But a question is whether adding a (predictable) branch to every hash function call is too slow.

as well as Go implementing support for the SVE opcodes.

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

As far as I know those env variables are only useful for telling the compiler what instructions it's allowed to use. A missing piece here is having build tags to select certain CPU features. That's tracked by golang/go#45454. So we'd need both new GOARM64 variants that allow for selecting for some kind of baseline feature level that includes SVE and golang/go#45454 to make that available as a build tag. With all of that we'd be able to avoid runtime feature detection.

klauspost · 2022-01-27T09:28:36Z

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

Careful. This is a protected instruction, and requires the OS to intercept the call and handle it. Not all operating systems do that, so your program can crash on init.

For cpuid it is implemented rather complex. There may be simpler methods, but this is what I found - not being any sort of arm expert.

On Linux, first we attempt it by hooking up to the internal runtime cpu detection, which gets populated by the runtime with data that isn't exposed publicly. AFAICT there is no other way to get this information:

//go:linkname hwcap internal/cpu.HWCap
var hwcap uint

If this isn't set /proc/self/auxv is read (not available on Android, though, so it must fail gracefully).

If CPUID flag isn't set by some of the above, we do not read MIDR_EL1, since it may crash.

For OSX, I have not found any method for determining features, so just a base set is enabled.

lizthegrey · 2022-01-27T10:32:23Z

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

That has a bit of a cost in terms of readability, but it'll work in the short term...

in any event, since there's some benefit for some real world use cases, no evidence of significant harm, and it's bottlenecking on floating point math on all but the newest hardware as @AWSjswinney says but there are significant speedups on the newest hardware, we should probably make the decision to go ahead and merge this as it is?

AWSjswinney · 2022-01-28T15:21:48Z

Adding an SVE implementation would be speculative at this point and it wouldn't be beneficial for a while until CPUs with SVE are widely available. By that time, the Go assembler should support all of the necessary SVE instructions. I haven't looked at the amd64 implementation much, but I suspect there may be something similar possible for that architecture with SSE or AVX. It's just an idea for the future.

I definitely think it makes sense to merge as is for now.

lizthegrey · 2022-02-23T00:40:44Z

Ping @cespare, any chance we can merge this so we can keep on iterating on it in future? The timing differences are basically noise if the bottleneck is the multiply units on older hardware, and it shows substantial benefit when the multiply unit bottleneck is removed by newest generation hardware.

cespare · 2022-02-23T00:52:29Z

@lizthegrey this remains on my TODO list. I need to understand and review the code.

lizthegrey · 2022-05-24T20:13:42Z

c7g graviton3 is now generally available. Profiling shows that this assembly code is now running at least 17% faster than on Graviton2.

https://flamegraph.com/share/18a16815-db9e-11ec-b50b-a2e3bd482a0d

snadampal · 2022-08-21T12:52:34Z

Any hope to get this reviewed and merged soon?

cespare

Hey @greatroar, awesome work on this! I finally had a chance to sit down with this code (and an aarch64 manual) and wrap my head around it. I learned a lot! This code is really nice.

I have several questions for you about things which seem like they can be simplified. They mostly seem like things you probably did for performance reasons but in my tests they didn't make a difference so I want to understand your reasoning in order to decide whether to apply these simplifications.

I'm hoping you can answer my questions but after that I'm happy to merge this and make any of the changes I mentioned myself as a follow-up. (My WIP branch for that is arm64-merge.)

For the record here are the benchmark results I'm currently looking at, comparing the pure Go version against your arm64 asm version (plus my tweaks, which make approximately no difference).

A c6g instance (graviton2):

name                 old speed      new speed      delta
Sum64/4B-4            476MB/s ± 0%   676MB/s ± 0%  +41.98%  (p=0.001 n=7+7)
Sum64/16B-4          1.12GB/s ± 0%  1.61GB/s ± 0%  +43.63%  (p=0.001 n=7+7)
Sum64/100B-4         2.04GB/s ± 0%  2.07GB/s ± 0%   +1.38%  (p=0.001 n=7+7)
Sum64/4KB-4          3.28GB/s ± 0%  3.27GB/s ± 0%   -0.09%  (p=0.001 n=7+6)
Sum64/10MB-4         3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.078 n=7+7)
Sum64String/4B-4      449MB/s ± 0%   491MB/s ± 0%   +9.34%  (p=0.001 n=7+6)
Sum64String/16B-4    1.09GB/s ± 0%  1.38GB/s ± 0%  +26.67%  (p=0.001 n=7+7)
Sum64String/100B-4   2.03GB/s ± 0%  2.08GB/s ± 0%   +2.63%  (p=0.001 n=7+7)
Sum64String/4KB-4    3.27GB/s ± 0%  3.27GB/s ± 0%   -0.01%  (p=0.001 n=6+7)
Sum64String/10MB-4   3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.073 n=6+7)
DigestBytes/4B-4      210MB/s ± 0%   209MB/s ± 0%   -0.36%  (p=0.001 n=7+6)
DigestBytes/16B-4     673MB/s ± 0%   671MB/s ± 0%   -0.41%  (p=0.001 n=7+7)
DigestBytes/100B-4   1.84GB/s ± 0%  1.80GB/s ± 0%   -2.53%  (p=0.001 n=7+7)
DigestBytes/4KB-4    3.26GB/s ± 0%  3.26GB/s ± 0%   -0.07%  (p=0.001 n=7+7)
DigestBytes/10MB-4   3.32GB/s ± 0%  3.32GB/s ± 0%     ~     (p=0.805 n=7+7)
DigestString/4B-4     202MB/s ± 0%   201MB/s ± 0%   -0.24%  (p=0.001 n=7+7)
DigestString/16B-4    652MB/s ± 0%   649MB/s ± 0%   -0.48%  (p=0.001 n=7+7)
DigestString/100B-4  1.81GB/s ± 0%  1.71GB/s ± 0%   -5.52%  (p=0.001 n=7+7)
DigestString/4KB-4   3.26GB/s ± 0%  3.25GB/s ± 0%   -0.31%  (p=0.001 n=7+7)
DigestString/10MB-4  3.32GB/s ± 0%  3.32GB/s ± 0%   +0.01%  (p=0.036 n=7+6)

A c7g instance (graviton3):

name                 old speed      new speed      delta
Sum64/4B-4            876MB/s ± 0%  1154MB/s ± 0%  +31.78%  (p=0.001 n=7+7)
Sum64/16B-4          1.98GB/s ± 0%  3.79GB/s ± 0%  +91.25%  (p=0.001 n=7+7)
Sum64/100B-4         4.99GB/s ± 0%  7.99GB/s ± 0%  +60.11%  (p=0.001 n=7+6)
Sum64/4KB-4          11.0GB/s ± 0%  16.2GB/s ± 0%  +46.58%  (p=0.004 n=5+6)
Sum64/10MB-4         11.2GB/s ± 1%  17.0GB/s ± 3%  +51.34%  (p=0.001 n=7+7)
Sum64String/4B-4      844MB/s ± 0%  1075MB/s ± 0%  +27.37%  (p=0.001 n=6+7)
Sum64String/16B-4    1.94GB/s ± 0%  3.52GB/s ± 0%  +81.44%  (p=0.001 n=7+7)
Sum64String/100B-4   4.90GB/s ± 0%  7.47GB/s ± 1%  +52.43%  (p=0.001 n=7+7)
Sum64String/4KB-4    11.0GB/s ± 0%  16.1GB/s ± 0%  +46.25%  (p=0.001 n=6+7)
Sum64String/10MB-4   11.2GB/s ± 1%  17.2GB/s ± 1%  +53.87%  (p=0.001 n=7+7)
DigestBytes/4B-4      381MB/s ± 0%   376MB/s ± 0%   -1.21%  (p=0.001 n=7+7)
DigestBytes/16B-4    1.23GB/s ± 0%  1.24GB/s ± 0%   +0.64%  (p=0.002 n=6+6)
DigestBytes/100B-4   4.22GB/s ± 0%  4.30GB/s ± 0%   +1.96%  (p=0.003 n=5+7)
DigestBytes/4KB-4    10.1GB/s ± 0%  15.6GB/s ± 0%  +54.83%  (p=0.001 n=7+7)
DigestBytes/10MB-4   10.3GB/s ± 1%  16.6GB/s ± 1%  +61.03%  (p=0.001 n=7+6)
DigestString/4B-4     339MB/s ± 0%   336MB/s ± 0%   -0.93%  (p=0.008 n=5+5)
DigestString/16B-4   1.21GB/s ± 0%  1.23GB/s ± 0%   +1.87%  (p=0.001 n=7+6)
DigestString/100B-4  4.13GB/s ± 0%  4.16GB/s ± 0%   +0.59%  (p=0.001 n=7+7)
DigestString/4KB-4   10.1GB/s ± 0%  15.5GB/s ± 0%  +54.32%  (p=0.001 n=7+6)
DigestString/10MB-4  10.4GB/s ± 0%  16.5GB/s ± 1%  +59.01%  (p=0.001 n=6+7)

xxhash_arm64.s

cespare · 2022-11-13T09:13:44Z

xxhash_arm64.s

+
+	round0(x1)
+	ROR  $64-27, h
+	EOR  x1 @> 64-27, h, h


This is another place where it seems like you have a slightly convoluted sequence which I'm guessing is a performance optimization? To my mind, the obvious way to write these two lines would be

EOR x1, h ROR $64-27, h

(and the same for many similar pairs of ROR/EOR below).

Same as before, my benchmarks showed essentially no difference with my simplified version. Were you seeing noticeable improvements from sequencing it this way?

Actually, this is

ROR $64-27, x1 EOR x1, h

except that it doesn't write to x1, but that doesn't matter here.

I don't recall benchmarking this specifically, but I've found that code density in general can matter a lot on smaller devices.

Actually, this is

ROR $64-27, x1 EOR x1, h

I don't think that's correct. I double-checked that my two instructions do the same thing as the original (effectively: h = rol27(h^x1)) but mine seems simpler because it's doing the xor first so it doesn't need the ROR addressing mode on the EOR instruction. Your two instructions above seem to be doing something different.

I've found that code density in general can matter a lot on smaller devices.

What do you mean by code density here? It's two instructions either way, right?

You're right. Sorry, I thought you meant only the last instruction.

After doing more thorough benchmarking I found that ordering the EOR after the ROR with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.

Benchmark results on Raspberry Pi 4B, Linux, Go 1.17.1: name old speed new speed delta Sum64/4B-4 180MB/s ± 0% 251MB/s ± 0% +39.13% (p=0.000 n=10+10) Sum64/100B-4 994MB/s ± 0% 1135MB/s ± 0% +14.25% (p=0.000 n=10+9) Sum64/4KB-4 1.92GB/s ± 0% 1.93GB/s ± 0% +0.43% (p=0.000 n=10+10) Sum64/10MB-4 1.88GB/s ± 0% 1.88GB/s ± 0% ~ (p=0.754 n=10+10) Sum64String/4B-4 133MB/s ± 4% 228MB/s ± 0% +71.37% (p=0.000 n=10+9) Sum64String/100B-4 949MB/s ± 0% 1103MB/s ± 0% +16.17% (p=0.000 n=10+10) Sum64String/4KB-4 1.92GB/s ± 0% 1.93GB/s ± 0% +0.40% (p=0.000 n=9+8) Sum64String/10MB-4 1.88GB/s ± 0% 1.88GB/s ± 0% ~ (p=0.146 n=10+8) DigestBytes/4B-4 61.9MB/s ± 0% 61.9MB/s ± 0% ~ (p=0.158 n=10+9) DigestBytes/100B-4 695MB/s ± 0% 719MB/s ± 0% +3.37% (p=0.000 n=10+10) DigestBytes/4KB-4 1.89GB/s ± 0% 1.90GB/s ± 0% +0.43% (p=0.000 n=9+10) DigestBytes/10MB-4 1.88GB/s ± 0% 1.89GB/s ± 0% +0.92% (p=0.000 n=10+9) DigestString/4B-4 58.9MB/s ± 0% 58.5MB/s ± 1% -0.60% (p=0.032 n=8+10) DigestString/100B-4 669MB/s ± 0% 696MB/s ± 1% +4.05% (p=0.000 n=10+10) DigestString/4KB-4 1.89GB/s ± 0% 1.89GB/s ± 0% +0.34% (p=0.000 n=10+10) DigestString/10MB-4 1.88GB/s ± 0% 1.89GB/s ± 0% +0.90% (p=0.000 n=10+10)

cespare

Thanks very much for this contribution @greatroar! I merged this and released v2.2.0.

cespare · 2022-12-02T22:10:03Z

xxhash_arm64.s

+
+	round0(x1)
+	ROR  $64-27, h
+	EOR  x1 @> 64-27, h, h


After doing more thorough benchmarking I found that ordering the EOR after the ROR with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.

* [zstd] sync xxhash with final accepted patch upstream Syncs with cespare/xxhash#51 * asmfmt

greatroar force-pushed the asm-arm64 branch 2 times, most recently from 8145830 to f6daa84 Compare August 23, 2021 06:26

greatroar force-pushed the asm-arm64 branch from f6daa84 to 497d216 Compare September 9, 2021 21:05

greatroar force-pushed the asm-arm64 branch from 497d216 to e3a7fe6 Compare October 30, 2021 16:10

lizthegrey mentioned this pull request Jan 9, 2022

zstd: add arm64 xxhash assembly klauspost/compress#464

Merged

lizthegrey approved these changes Jan 27, 2022

View reviewed changes

cespare reviewed Nov 13, 2022

View reviewed changes

greatroar force-pushed the asm-arm64 branch from e3a7fe6 to 868d916 Compare November 13, 2022 13:00

greatroar force-pushed the asm-arm64 branch from 868d916 to 777d178 Compare November 13, 2022 13:03

cespare merged commit 3211b29 into cespare:main Dec 4, 2022

cespare reviewed Dec 4, 2022

View reviewed changes

lizthegrey mentioned this pull request Dec 4, 2022

zstd: sync xxhash with final accepted patch upstream klauspost/compress#707

Merged

klauspost pushed a commit to klauspost/compress that referenced this pull request Dec 4, 2022

zstd: sync xxhash with final accepted patch upstream (#707)

0d2a371

* [zstd] sync xxhash with final accepted patch upstream Syncs with cespare/xxhash#51 * asmfmt

greatroar deleted the asm-arm64 branch December 4, 2022 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembler implementation for arm64 #51

Assembler implementation for arm64 #51

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

cespare commented Sep 9, 2021

greatroar commented Oct 30, 2021

cespare commented Oct 30, 2021

greatroar commented Oct 31, 2021

Imanmahmoodi commented Dec 27, 2021

lizthegrey commented Jan 8, 2022

cespare commented Jan 10, 2022

lizthegrey commented Jan 10, 2022 •

edited

Loading

cespare commented Jan 11, 2022

AWSjswinney commented Jan 25, 2022

AWSjswinney commented Jan 25, 2022

cespare commented Jan 25, 2022

lizthegrey commented Jan 27, 2022

cespare commented Jan 27, 2022

klauspost commented Jan 27, 2022

lizthegrey commented Jan 27, 2022

AWSjswinney commented Jan 28, 2022

lizthegrey commented Feb 23, 2022

cespare commented Feb 23, 2022

lizthegrey commented May 24, 2022 •

edited

Loading

snadampal commented Aug 21, 2022

cespare left a comment

cespare Nov 13, 2022

greatroar Nov 13, 2022

cespare Nov 14, 2022

greatroar Nov 14, 2022

cespare Dec 2, 2022

cespare left a comment

cespare Dec 2, 2022

Assembler implementation for arm64 #51

Assembler implementation for arm64 #51

Conversation

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

cespare commented Sep 9, 2021

greatroar commented Oct 30, 2021

cespare commented Oct 30, 2021

greatroar commented Oct 31, 2021

Imanmahmoodi commented Dec 27, 2021

lizthegrey commented Jan 8, 2022

cespare commented Jan 10, 2022

lizthegrey commented Jan 10, 2022 • edited Loading

cespare commented Jan 11, 2022

AWSjswinney commented Jan 25, 2022

AWSjswinney commented Jan 25, 2022

cespare commented Jan 25, 2022

lizthegrey commented Jan 27, 2022

cespare commented Jan 27, 2022

klauspost commented Jan 27, 2022

lizthegrey commented Jan 27, 2022

AWSjswinney commented Jan 28, 2022

lizthegrey commented Feb 23, 2022

cespare commented Feb 23, 2022

lizthegrey commented May 24, 2022 • edited Loading

snadampal commented Aug 21, 2022

cespare left a comment

Choose a reason for hiding this comment

cespare Nov 13, 2022

Choose a reason for hiding this comment

greatroar Nov 13, 2022

Choose a reason for hiding this comment

cespare Nov 14, 2022

Choose a reason for hiding this comment

greatroar Nov 14, 2022

Choose a reason for hiding this comment

cespare Dec 2, 2022

Choose a reason for hiding this comment

cespare left a comment

Choose a reason for hiding this comment

cespare Dec 2, 2022

Choose a reason for hiding this comment

lizthegrey commented Jan 10, 2022 •

edited

Loading

lizthegrey commented May 24, 2022 •

edited

Loading