-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd: add arm64 xxhash assembly #464
Conversation
@lizthegrey Thanks a bunch! |
Is this a net benefit or speed-up? More of the test cases show a speed-up than a slowdown, but it's close, and the effect size distribution is asymmetric. There are more large slowdowns than large speed-ups (say >2%). I see two 4% slowdowns, one 5%, one 6%, and an 8%. For speed-ups, I see one 16%, a 4%, and two 2%. Most speed-ups are ≤1%. The XXHASH project didn't accept this code because it didn't appear to offer a net speed-up, was too mixed. What's the benefit for compress? With the results so mixed, I'd weigh the different test cases based on contextual relevance and impact on the use case, in this case the Zstd implementation. Is that what you already did? More broadly, it looks like performance improvement will require vectorization and other newer instructions. That's where assembly shines the most, especially given the fact that the Go compiler makes little use of autovectorization and newer instructions in general. I don't know if Armv8.x or 9 has bit manipulation instructions, but that's a good example of newer instructions that pay off, and aren't SIMD. (They're in Haswell and later, and the AMD counterparts.) |
@JoeUX What do you propose? |
In our empirical testing and profiling on a fairly substantial workload it was a net speed-up. But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does. |
I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change. |
Was this testing published somewhere else? The ones published here show more chunky slowdowns than speed-ups. |
No, sadly, this is the results of our production Kafka Sarama -> Klauspost Compress workload which we profiled, rather than something that can handily be digested into a separate benchmark. |
I don't know who "we" are. The speedup is minor, but seems to be an overall gain. If "we" are concerned, you can use the build tags to exclude the assembly. Unless you contribute your own benchmarks that prove the opposite, I consider this matter closed. |
I've solicited some help from the AWS Graviton team with further optimising this ASM. Even if there is a small regression on some practical workloads, my hope is that getting some mileage/confidence on this current ASM and then improving will help us make the changes incrementally rather than big bang all at once trying to both convert and optimise at the same time. Our data does show a modest improvement, but hey, maybe we're not a representative workload :) |
See cespare/xxhash#51 (comment) The issue is that no matter whether you're using ASM or native Go, on Neoverse N1 and on Cortex A72 (and older) you'll see bottlenecking on floating point units, but once that bottleneck is removed (Graviton3), the ASM is much faster. That definitely explains why benchmarking results are inconsistent on earlier hardware. |
see cespare/xxhash#51