-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assembler implementation for arm64 #51
Conversation
Awesome! It might take me a while to find the time to review this because I'm not too familiar with arm64 yet.
I don't have physical access to hardware, but I can test on graviton2 instances in EC2 later if that helps. |
8145830
to
f6daa84
Compare
@greatroar sorry for the delay. I forgot about this for a while. I started taking a look but it will take me a bit to get up to speed on arm assembly. I'm pretty swamped with work atm so it may be a bit before I finish reviewing. Feel free to ping me if you haven't heard back in a couple of weeks. |
Rebased to fix the merge commit. I've also finally found time to run this on actual hardware, a Raspberry Pi 4B. Benchmark results:
I'll admit this is disappointing, but there is a speedup for small inputs. I find it hard to explain why the speedup stops at 4KiB. This CPU has a 32KiB L1 cache, so it shouldn't be running at L2 speed for that input size. |
Hey @greatroar, sorry I still haven't had time to review the code. It's still on my radar. I did have a few minutes to spin up an EC2 instance to try the benchmarks on a beefier arm64 chip (I used a c6g.metal instance). The results are similar to your raspberry pi results but even more pronounced:
It's suspicious that the speed is so precisely similar for larger block sizes. Seems like we must be hitting a hard bottleneck of some kind or else the generated code on the main path must be almost the same as the handwritten code. |
The generated code is very similar, the main difference being the use of LDP/STP to transfer 16-byte blocks in the handwritten code. That doesn't seem to buy much. The reason it's faster on smaller inputs may be that the handwritten code does everything in registers, while the compiled code spills some stuff on the stack, and it loads the primes faster. |
😃 |
Please let me know if you need help testing this; always happy to help volunteer to make things run faster on arm64. |
see cespare/xxhash#51 ``` benchmark old ns/op new ns/op delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 14740529 14093294 -4.39% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 3003737 3008035 +0.14% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 70052218 70885931 +1.19% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 33424884 34714144 +3.86% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 9694846 9563735 -1.35% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 12322865 12782681 +3.73% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 7290319 7134523 -2.14% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 825251 816683 -1.04% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 462903 486823 +5.17% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 1333159120 1114784390 -16.38% BenchmarkDecoder_DecoderSmall/html.zst-16 3117950 3095118 -0.73% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 272085 271508 -0.21% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 1411294 1407632 -0.26% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 370870 367499 -0.91% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 4721330 4718339 -0.06% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 3517766 3487756 -0.85% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 1186672 1180367 -0.53% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 1498383 1502922 +0.30% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 748113 742537 -0.75% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 99132 98206 -0.93% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 53805 53209 -1.11% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 4041818 4028347 -0.33% BenchmarkDecoder_DecodeAll/html.zst-16 387794 383309 -1.16% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 34390 34296 -0.27% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 89510 88785 -0.81% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 23315 23128 -0.80% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 306437 325176 +6.12% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 226433 222589 -1.70% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 74612 74182 -0.58% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 95066 94304 -0.80% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 47498 46946 -1.16% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 6291 6237 -0.86% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 3498 3453 -1.29% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 295850 318916 +7.80% BenchmarkDecoder_DecodeAllParallel/html.zst-16 24340 24196 -0.59% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 2220 2199 -0.95% benchmark old MB/s new MB/s speedup BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 100.03 104.63 1.05x BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 315.84 315.39 1.00x BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 55.03 54.38 0.99x BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 102.14 98.35 0.96x BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 103.30 104.71 1.01x BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 98.74 95.18 0.96x BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 449.47 459.29 1.02x BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 992.67 1003.08 1.01x BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 2127.32 2022.80 0.95x BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 4.21 5.04 1.20x BenchmarkDecoder_DecoderSmall/html.zst-16 262.74 264.67 1.01x BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 119.84 120.10 1.00x BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 130.60 130.94 1.00x BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 319.76 322.69 1.01x BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 102.06 102.13 1.00x BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 121.31 122.36 1.01x BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 105.49 106.05 1.01x BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 101.50 101.20 1.00x BenchmarkDecoder_DecodeAll/html_x_4.zst-16 547.51 551.62 1.01x BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 1032.97 1042.70 1.01x BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 2287.77 2313.39 1.01x BenchmarkDecoder_DecodeAll/urls.10K.zst-16 173.71 174.29 1.00x BenchmarkDecoder_DecodeAll/html.zst-16 264.06 267.15 1.01x BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 118.52 118.85 1.00x BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 2059.22 2076.03 1.01x BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 5086.37 5127.46 1.01x BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 1572.47 1481.85 0.94x BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 1884.68 1917.22 1.02x BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 1677.73 1687.46 1.01x BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 1599.82 1612.75 1.01x BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 8623.46 8724.94 1.01x BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 16276.14 16418.01 1.01x BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 35194.18 35648.95 1.01x BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 2373.12 2201.48 0.93x BenchmarkDecoder_DecodeAllParallel/html.zst-16 4207.06 4232.02 1.01x BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 1835.78 1853.29 1.01x benchmark old allocs new allocs delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 1 3 +200.00% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 4 5 +25.00% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 52 50 -3.85% BenchmarkDecoder_DecoderSmall/html.zst-16 1 1 +0.00% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 1 1 +0.00% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 0 0 +0.00% benchmark old bytes new bytes delta BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16 89288 113942 +27.61% BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16 10888 12311 +13.07% BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16 981385 1492955 +52.13% BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16 385552 367000 -4.81% BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16 30752 30503 -0.81% BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16 108837 56868 -47.75% BenchmarkDecoder_DecoderSmall/html_x_4.zst-16 75923 97336 +28.20% BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16 48 48 +0.00% BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16 1964 1909 -2.80% BenchmarkDecoder_DecoderSmall/urls.10K.zst-16 25456880 25432640 -0.10% BenchmarkDecoder_DecoderSmall/html.zst-16 48 48 +0.00% BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16 51 48 -5.88% BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/geo.protodata.zst-16 5 0 -100.00% BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16 2 0 -100.00% BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16 26 0 -100.00% BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/alice29.txt.zst-16 10 0 -100.00% BenchmarkDecoder_DecodeAll/html_x_4.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16 1 0 -100.00% BenchmarkDecoder_DecodeAll/urls.10K.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/html.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16 0 0 +0.00% BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16 234 231 -1.28% BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16 38 38 +0.00% BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16 1991 2287 +14.87% BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16 1501 1298 -13.52% BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16 132 130 -1.52% BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16 214 220 +2.80% BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16 259 258 -0.39% BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16 9 9 +0.00% BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16 6 6 +0.00% BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16 2868 3358 +17.09% BenchmarkDecoder_DecodeAllParallel/html.zst-16 34 34 +0.00% BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16 0 0 +0.00% ```
@lizthegrey the problem is more that the testing has shown that the improvement is marginal. |
We see a 7% improvement in time spent in xxhash from adopting this within the @klauspost zstd library. https://share.getcloudapp.com/GGuEkRnR IMO this is worth merging. |
@lizthegrey it would be helpful to know how the results you've screenshotted correspond to the microbenchmarks we've been discussing above. Do the zstd benchmarks run xxhash over mostly very short strings, for instance? I took another look at the benchmarks using Go 1.18beta1. (I also added a 16-byte comparison on main.) Here are the results, again using a c6g.metal EC2 instance:
We still see that the asm is significantly faster only for small strings (4 and 16 bytes) when using the one-shot hashing functions and the advantage has narrowed a little due to compiler improvements for arm64 in Go 1.18. There is also a measurable slowdown from the asm in a couple of the benchmarks. For context, here is a comparison on amd64 (using a c5.metal instance) between purego and asm:
You can see that the assembly version offers a clear benefit over pure Go across the board, with the largest improvement going to the biggest inputs. I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler. That said, I would like to sit down and understand the implementation and see if there are ways to improve it, particularly the handling of full blocks. I'll try to make some time for that soon. |
This is a valid concern. Let me add another data point. Graviton 2 and the Raspberry Pi 4 may not see significant improvement because they are most likely bottlenecking on the floating-point/simd pipelines. Other CPUs, such as Graviton 3, will see improvement from this change. I tested this code as written on a c7g.2xlarge and go 1.17.
This instance type is currently in customer preview, so it's not widely available, but once it is, this code will be definitely worthwhile. |
There aren't NEON instructions that support 64bit in and out multiply-adds, so using these scalar instructions as written is probably the best we can do for now. However, SVE does include instructions like this, so it's possible a newer implementation could see even better performance on CPUs with SVE support. |
@AWSjswinney very nice! That's helpful and does give a lot more motivation for this change. (Also, the absolute numbers from that chip look great.) |
The challenge is going to be detecting whether the CPU supports SVE, as well as Go implementing support for the SVE opcodes. Until then I think we have to default compatible and just go with what's supported by default in Go. See golang/go#44734 There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use. |
Some googling led me to this page, which says:
We could probably do that at startup. But a question is whether adding a (predictable) branch to every hash function call is too slow.
Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.
As far as I know those env variables are only useful for telling the compiler what instructions it's allowed to use. A missing piece here is having build tags to select certain CPU features. That's tracked by golang/go#45454. So we'd need both new GOARM64 variants that allow for selecting for some kind of baseline feature level that includes SVE and golang/go#45454 to make that available as a build tag. With all of that we'd be able to avoid runtime feature detection. |
Careful. This is a protected instruction, and requires the OS to intercept the call and handle it. Not all operating systems do that, so your program can crash on init. For cpuid it is implemented rather complex. There may be simpler methods, but this is what I found - not being any sort of arm expert. On Linux, first we attempt it by hooking up to the internal runtime cpu detection, which gets populated by the runtime with data that isn't exposed publicly. AFAICT there is no other way to get this information:
If this isn't set If CPUID flag isn't set by some of the above, we do not read MIDR_EL1, since it may crash. For OSX, I have not found any method for determining features, so just a base set is enabled. |
That has a bit of a cost in terms of readability, but it'll work in the short term... in any event, since there's some benefit for some real world use cases, no evidence of significant harm, and it's bottlenecking on floating point math on all but the newest hardware as @AWSjswinney says but there are significant speedups on the newest hardware, we should probably make the decision to go ahead and merge this as it is? |
Adding an SVE implementation would be speculative at this point and it wouldn't be beneficial for a while until CPUs with SVE are widely available. By that time, the Go assembler should support all of the necessary SVE instructions. I haven't looked at the amd64 implementation much, but I suspect there may be something similar possible for that architecture with SSE or AVX. It's just an idea for the future. I definitely think it makes sense to merge as is for now. |
Ping @cespare, any chance we can merge this so we can keep on iterating on it in future? The timing differences are basically noise if the bottleneck is the multiply units on older hardware, and it shows substantial benefit when the multiply unit bottleneck is removed by newest generation hardware. |
@lizthegrey this remains on my TODO list. I need to understand and review the code. |
c7g graviton3 is now generally available. Profiling shows that this assembly code is now running at least 17% faster than on Graviton2. https://flamegraph.com/share/18a16815-db9e-11ec-b50b-a2e3bd482a0d |
Any hope to get this reviewed and merged soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @greatroar, awesome work on this! I finally had a chance to sit down with this code (and an aarch64 manual) and wrap my head around it. I learned a lot! This code is really nice.
I have several questions for you about things which seem like they can be simplified. They mostly seem like things you probably did for performance reasons but in my tests they didn't make a difference so I want to understand your reasoning in order to decide whether to apply these simplifications.
I'm hoping you can answer my questions but after that I'm happy to merge this and make any of the changes I mentioned myself as a follow-up. (My WIP branch for that is arm64-merge
.)
For the record here are the benchmark results I'm currently looking at, comparing the pure Go version against your arm64 asm version (plus my tweaks, which make approximately no difference).
A c6g instance (graviton2):
name old speed new speed delta
Sum64/4B-4 476MB/s ± 0% 676MB/s ± 0% +41.98% (p=0.001 n=7+7)
Sum64/16B-4 1.12GB/s ± 0% 1.61GB/s ± 0% +43.63% (p=0.001 n=7+7)
Sum64/100B-4 2.04GB/s ± 0% 2.07GB/s ± 0% +1.38% (p=0.001 n=7+7)
Sum64/4KB-4 3.28GB/s ± 0% 3.27GB/s ± 0% -0.09% (p=0.001 n=7+6)
Sum64/10MB-4 3.32GB/s ± 0% 3.32GB/s ± 0% ~ (p=0.078 n=7+7)
Sum64String/4B-4 449MB/s ± 0% 491MB/s ± 0% +9.34% (p=0.001 n=7+6)
Sum64String/16B-4 1.09GB/s ± 0% 1.38GB/s ± 0% +26.67% (p=0.001 n=7+7)
Sum64String/100B-4 2.03GB/s ± 0% 2.08GB/s ± 0% +2.63% (p=0.001 n=7+7)
Sum64String/4KB-4 3.27GB/s ± 0% 3.27GB/s ± 0% -0.01% (p=0.001 n=6+7)
Sum64String/10MB-4 3.32GB/s ± 0% 3.32GB/s ± 0% ~ (p=0.073 n=6+7)
DigestBytes/4B-4 210MB/s ± 0% 209MB/s ± 0% -0.36% (p=0.001 n=7+6)
DigestBytes/16B-4 673MB/s ± 0% 671MB/s ± 0% -0.41% (p=0.001 n=7+7)
DigestBytes/100B-4 1.84GB/s ± 0% 1.80GB/s ± 0% -2.53% (p=0.001 n=7+7)
DigestBytes/4KB-4 3.26GB/s ± 0% 3.26GB/s ± 0% -0.07% (p=0.001 n=7+7)
DigestBytes/10MB-4 3.32GB/s ± 0% 3.32GB/s ± 0% ~ (p=0.805 n=7+7)
DigestString/4B-4 202MB/s ± 0% 201MB/s ± 0% -0.24% (p=0.001 n=7+7)
DigestString/16B-4 652MB/s ± 0% 649MB/s ± 0% -0.48% (p=0.001 n=7+7)
DigestString/100B-4 1.81GB/s ± 0% 1.71GB/s ± 0% -5.52% (p=0.001 n=7+7)
DigestString/4KB-4 3.26GB/s ± 0% 3.25GB/s ± 0% -0.31% (p=0.001 n=7+7)
DigestString/10MB-4 3.32GB/s ± 0% 3.32GB/s ± 0% +0.01% (p=0.036 n=7+6)
A c7g instance (graviton3):
name old speed new speed delta
Sum64/4B-4 876MB/s ± 0% 1154MB/s ± 0% +31.78% (p=0.001 n=7+7)
Sum64/16B-4 1.98GB/s ± 0% 3.79GB/s ± 0% +91.25% (p=0.001 n=7+7)
Sum64/100B-4 4.99GB/s ± 0% 7.99GB/s ± 0% +60.11% (p=0.001 n=7+6)
Sum64/4KB-4 11.0GB/s ± 0% 16.2GB/s ± 0% +46.58% (p=0.004 n=5+6)
Sum64/10MB-4 11.2GB/s ± 1% 17.0GB/s ± 3% +51.34% (p=0.001 n=7+7)
Sum64String/4B-4 844MB/s ± 0% 1075MB/s ± 0% +27.37% (p=0.001 n=6+7)
Sum64String/16B-4 1.94GB/s ± 0% 3.52GB/s ± 0% +81.44% (p=0.001 n=7+7)
Sum64String/100B-4 4.90GB/s ± 0% 7.47GB/s ± 1% +52.43% (p=0.001 n=7+7)
Sum64String/4KB-4 11.0GB/s ± 0% 16.1GB/s ± 0% +46.25% (p=0.001 n=6+7)
Sum64String/10MB-4 11.2GB/s ± 1% 17.2GB/s ± 1% +53.87% (p=0.001 n=7+7)
DigestBytes/4B-4 381MB/s ± 0% 376MB/s ± 0% -1.21% (p=0.001 n=7+7)
DigestBytes/16B-4 1.23GB/s ± 0% 1.24GB/s ± 0% +0.64% (p=0.002 n=6+6)
DigestBytes/100B-4 4.22GB/s ± 0% 4.30GB/s ± 0% +1.96% (p=0.003 n=5+7)
DigestBytes/4KB-4 10.1GB/s ± 0% 15.6GB/s ± 0% +54.83% (p=0.001 n=7+7)
DigestBytes/10MB-4 10.3GB/s ± 1% 16.6GB/s ± 1% +61.03% (p=0.001 n=7+6)
DigestString/4B-4 339MB/s ± 0% 336MB/s ± 0% -0.93% (p=0.008 n=5+5)
DigestString/16B-4 1.21GB/s ± 0% 1.23GB/s ± 0% +1.87% (p=0.001 n=7+6)
DigestString/100B-4 4.13GB/s ± 0% 4.16GB/s ± 0% +0.59% (p=0.001 n=7+7)
DigestString/4KB-4 10.1GB/s ± 0% 15.5GB/s ± 0% +54.32% (p=0.001 n=7+6)
DigestString/10MB-4 10.4GB/s ± 0% 16.5GB/s ± 1% +59.01% (p=0.001 n=6+7)
|
||
round0(x1) | ||
ROR $64-27, h | ||
EOR x1 @> 64-27, h, h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where it seems like you have a slightly convoluted sequence which I'm guessing is a performance optimization? To my mind, the obvious way to write these two lines would be
EOR x1, h
ROR $64-27, h
(and the same for many similar pairs of ROR/EOR
below).
Same as before, my benchmarks showed essentially no difference with my simplified version. Were you seeing noticeable improvements from sequencing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is
ROR $64-27, x1
EOR x1, h
except that it doesn't write to x1, but that doesn't matter here.
I don't recall benchmarking this specifically, but I've found that code density in general can matter a lot on smaller devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is
ROR $64-27, x1 EOR x1, h
I don't think that's correct. I double-checked that my two instructions do the same thing as the original (effectively: h = rol27(h^x1)
) but mine seems simpler because it's doing the xor first so it doesn't need the ROR addressing mode on the EOR instruction. Your two instructions above seem to be doing something different.
I've found that code density in general can matter a lot on smaller devices.
What do you mean by code density here? It's two instructions either way, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Sorry, I thought you meant only the last instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing more thorough benchmarking I found that ordering the EOR
after the ROR
with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.
Benchmark results on Raspberry Pi 4B, Linux, Go 1.17.1: name old speed new speed delta Sum64/4B-4 180MB/s ± 0% 251MB/s ± 0% +39.13% (p=0.000 n=10+10) Sum64/100B-4 994MB/s ± 0% 1135MB/s ± 0% +14.25% (p=0.000 n=10+9) Sum64/4KB-4 1.92GB/s ± 0% 1.93GB/s ± 0% +0.43% (p=0.000 n=10+10) Sum64/10MB-4 1.88GB/s ± 0% 1.88GB/s ± 0% ~ (p=0.754 n=10+10) Sum64String/4B-4 133MB/s ± 4% 228MB/s ± 0% +71.37% (p=0.000 n=10+9) Sum64String/100B-4 949MB/s ± 0% 1103MB/s ± 0% +16.17% (p=0.000 n=10+10) Sum64String/4KB-4 1.92GB/s ± 0% 1.93GB/s ± 0% +0.40% (p=0.000 n=9+8) Sum64String/10MB-4 1.88GB/s ± 0% 1.88GB/s ± 0% ~ (p=0.146 n=10+8) DigestBytes/4B-4 61.9MB/s ± 0% 61.9MB/s ± 0% ~ (p=0.158 n=10+9) DigestBytes/100B-4 695MB/s ± 0% 719MB/s ± 0% +3.37% (p=0.000 n=10+10) DigestBytes/4KB-4 1.89GB/s ± 0% 1.90GB/s ± 0% +0.43% (p=0.000 n=9+10) DigestBytes/10MB-4 1.88GB/s ± 0% 1.89GB/s ± 0% +0.92% (p=0.000 n=10+9) DigestString/4B-4 58.9MB/s ± 0% 58.5MB/s ± 1% -0.60% (p=0.032 n=8+10) DigestString/100B-4 669MB/s ± 0% 696MB/s ± 1% +4.05% (p=0.000 n=10+10) DigestString/4KB-4 1.89GB/s ± 0% 1.89GB/s ± 0% +0.34% (p=0.000 n=10+10) DigestString/10MB-4 1.88GB/s ± 0% 1.89GB/s ± 0% +0.90% (p=0.000 n=10+10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this contribution @greatroar! I merged this and released v2.2.0.
|
||
round0(x1) | ||
ROR $64-27, h | ||
EOR x1 @> 64-27, h, h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing more thorough benchmarking I found that ordering the EOR
after the ROR
with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.
* [zstd] sync xxhash with final accepted patch upstream Syncs with cespare/xxhash#51 * asmfmt
Fixes #45, but I haven't yet benchmarked this on actual ARM hardware. On amd64 with qemu, I get the following results after cherry-picking #50 (I have a version that includes a variant of #42 as well):
I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.