Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blake2b: Add aarch64 NEON acceleration #1415

Closed
wants to merge 1 commit into from

Conversation

r1mikey
Copy link

@r1mikey r1mikey commented Oct 2, 2024

Import NEON acceleration from blake2b-ref. Copyrights reflect those in the upstream files, specifically:
https://github.com/BLAKE2/BLAKE2/blob/master/neon/blake2b-neon.c

I've restricted this to aarch64 as I do not have an environment in which to test armv7-a support.

Successfully built and run on both aarch64 (Nvidia Grace) and x86_64. Tested on both with make verify.

Import NEON acceleration from blake2b-ref. Copyrights reflect those in
the upstream files, specifically:
  https://github.com/BLAKE2/BLAKE2/blob/master/neon/blake2b-neon.c
@r1mikey
Copy link
Author

r1mikey commented Oct 2, 2024

Let's hold on merging this while I do some benchmarking as well.

@jedisct1
Copy link
Owner

jedisct1 commented Oct 2, 2024

Something like this would be required after including <arm_neon.h>:

#    ifdef __clang__
#        pragma clang attribute push(__attribute__((target("neon"))), apply_to = function)
#    elif defined(__GNUC__)
#        pragma GCC target("+simd")
#    endif

@r1mikey
Copy link
Author

r1mikey commented Oct 3, 2024

This is not worth pursuing for 64 bit Arm systems based on my benchmarks. Performance of the NEON implementation comes in at around 66% of the C implementation. There are a few reasons for this:

  • Instruction density is 89% of the C implementation (good)
  • Instructions per cycle is at 46% of the C implementation (very bad)
  • 54.6% of slots are backend bound (vs 21.3% for the C implementation - also very bad)
  • The benchmark completes in 10,718.84 msec compared to 5,542.82 msec for the C implementation

@r1mikey r1mikey closed this Oct 3, 2024
@jedisct1
Copy link
Owner

jedisct1 commented Oct 3, 2024

Thanks for having taken the time to run these benchmarks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants