NEON backend for aarch64 #457

rubdos · 2022-12-08T12:14:03Z

As promised in #449! This is joint work with @Tarinn. In fact, @Tarinn did most of the work :-)

Fixes #147 for Aarch64, and you'll see an ARMv7 PR coming in after this.

I'm working on benchmarks across several devices now (including an unpublished very hacky armv7 version), will update here when they become available. We're seeing speedups of 20-30% in relevant benchmarks. This is the same code as in zkcrypto#19.

I'll be cleaning up the commits in the next few minutes (trailing white space fixes, mostly), and I'll run a comparison benchmark for 343be3a, because that was only introduced for future ARMv7 support, and has not been tested under load.

TODO:

Cleanup whitespace and commits
Test shuffle! call vs vqtbx1q_u8 performance
Add Aarch64 cross build to CI
Figure out how-the-💥 we get rid of the ARMv8 TBL/TBX issue
Add NEON documentation to README
Explain that ARMv7 is not yet supported

rubdos · 2022-12-08T14:13:51Z

The shuffle! macro use in 26c494f seems to win 1.5% to 3% over the use of the vqtbx1q_u8 intrinsic on a Raspberry Pi 4. Seems like the LLVM compiler people know what they're doing.

I'm testing another point where we can use the shuffle! macro instead of some manual lane swapping, will report back.

rubdos · 2022-12-08T15:00:03Z

2d343e9 does not have an influence on performance, but the code is quite a bit cleaner, so let's leave it in.

tarcieri · 2022-12-08T17:42:05Z

.github/workflows/rust.yml

+  test-simd:
+    name: Test simd backend (nightly, aarch64)
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - uses: dtolnay/rust-toolchain@nightly
+      with:
+        toolchain: aarch64-unknown-linux-gnu
+    - run: cargo test --no-default-features --target aarch64-unknown-linux-gnu --features "std simd_backend"


How does this work without cross?

Rust has a built-in cross-compiler. I've never really used cross to cross-compile myself; always had to roll my own due to circumstances.

Probably, the underlaying Ubuntu machine needs to install a gcc toolchain to actually work, but I hoped that the CI would give me some feedback already.

A cross-compiler works for builds, but how does it run the tests on a non-native architecture?

cross uses QEMU

Oh okay, I did not spot that. The new version is only with cargo build, not with test, based on the new build-simd CI snippet.

I'd suggest using cross for this as it should be fine for testing NEON in CI.

Here are some examples (and it's probably fine to reuse the @RustCrypto cross-install workflow too)

https://github.com/RustCrypto/block-ciphers/blob/master/.github/workflows/aes.yml#L201-L249

Seems like that's done in release/4.0 now, so I'll just drop my own commit.

I'd still like to point out that when I submit the armv7hl NEON work, it will work on nightly only for a while, so you might want to add armv7-nightly to the test matrix.

tarcieri · 2022-12-08T19:13:38Z

@rubdos we're working off the releases/4.0 branch for development, so you'll definitely want to rebase this

jrose-signal · 2022-12-12T18:08:04Z

Don't forget to make sure it's still constant-time! I don't actually know what that entails; it is not something I've looked into before. But a good amount of effort was put into the initial version of this crate using constant-time operations, and new backends should adhere to that too.

rubdos · 2022-12-13T14:34:56Z

@jrose-signal Indeed, that's pretty important. As far as @Tarinn and I know, this code should be constant time. It is a direct port of the AVX2 work, and there should be no time-dependent calls in there. The most sketchy call that I could see is the shuffle!, but that does not take secret-dependent indexes.

I'm not sure how this is generally done. Should we ask some independent reviewer to have a look at the code?

pinkforest · 2022-12-13T14:35:04Z

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Also would we have any benchmarks between u64, neon and fiat ?
e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

For constant time, e.g. subtle CsOption etc.

rubdos · 2022-12-13T14:41:39Z

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Do we have any benchmarks between u64, neon and fiat ?
e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

I'm working on this across a bunch of devices (including armv7). I'd like these results to appear in a small publication, but I'll certainly dump a summary in the README when I have them. I should have something by the end of the week.

pinkforest · 2022-12-13T14:44:41Z

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Yeah this is going to stay in main

Development / Maintenance / Release workflow #466

I'm working on this across a bunch of devices

Awesome! 🥳 would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too:
https://github.com/rust-lang/this-week-in-rust/

rubdos · 2022-12-13T14:47:31Z

Awesome! partying_face would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too: https://github.com/rust-lang/this-week-in-rust/

After publication, I'll make sure to make a huge amount of noise about it. That should be pretty short notice, I have some very strict deadlines about that now. Just to clarify: the deadline is there because this work is complete; there was no pressure on completion of this work that would have affected the quality.

pinkforest · 2022-12-13T17:27:35Z

src/backend/vector/neon/field.rs

+impl FieldElement2625x4 {
+
+    pub fn split(&self) -> [FieldElement51; 4] {
+        let mut out = [FieldElement51::zero(); 4];


Suggested change

let mut out = [FieldElement51::zero(); 4];

let mut out = [FieldElement51::ZERO; 4];

This was changed to ZERO

Use inherent constants for ZERO, ONE, and MINUS_ONE #470

pinkforest · 2022-12-13T17:30:26Z

src/backend/vector/neon/edwards.rs

+
+        macro_rules! print_var {
+            ($x:ident) => {
+                println!("{} = {:?}", stringify!($x), $x.to_bytes());


Suggested change

println!("{} = {:?}", stringify!($x), $x.to_bytes());

println!("{} = {:?}", stringify!($x), $x.as_bytes());

These were renamed to as_bytes()

pinkforest · 2022-12-13T17:36:26Z

src/backend/vector/neon/edwards.rs

+    use super::*;
+
+    fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint {
+        use backend::serial::u64::field::FieldElement51;


Suggested change

use backend::serial::u64::field::FieldElement51;

use crate::backend::serial::u64::field::FieldElement51;

pinkforest · 2022-12-13T18:02:08Z

Instead of flooding some changes here is some .patches:

src/backend/vector/neon/field.rs

diff --git a/src/backend/vector/neon/field.rs b/src/backend/vector/neon/field.rs
index 5cf57ee..85630cd 100644
--- a/src/backend/vector/neon/field.rs
+++ b/src/backend/vector/neon/field.rs
@@ -21,7 +21,7 @@
 //! arm instructions.
 
 use core::ops::{Add, Mul, Neg};
-use packed_simd::{u32x4, u32x2, i32x4, u8x16, u64x4, u64x2, IntoBits};
+use packed_simd::{u32x4, u32x2, i32x4, u64x4, u64x2, IntoBits};
 
 use crate::backend::vector::neon::constants::{P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO};
 use crate::backend::serial::u64::field::FieldElement51;
@@ -141,7 +141,7 @@ impl ConditionallySelectable for FieldElement2625x4 {
 impl FieldElement2625x4 {
 
     pub fn split(&self) -> [FieldElement51; 4] {
-        let mut out = [FieldElement51::zero(); 4];
+        let mut out = [FieldElement51::ZERO; 4];
         for i in 0..5 {
             let a_2i   = self.0[i].0.extract(0) as u64;
             let b_2i   = self.0[i].0.extract(1) as u64;
@@ -367,7 +367,9 @@ impl FieldElement2625x4 {
     #[inline]
     fn reduce64(mut z: [(u64x2, u64x2); 10]) -> FieldElement2625x4 {
 
+        #[allow(non_snake_case)]
         let LOW_25_BITS: u64x2 = u64x2::splat((1 << 25) - 1);
+        #[allow(non_snake_case)]
         let LOW_26_BITS: u64x2 = u64x2::splat((1 << 26) - 1);
 
         let carry = |z: &mut [(u64x2, u64x2); 10], i: usize| {
@@ -419,6 +421,7 @@ impl FieldElement2625x4 {
         ])
     }
 
+    #[allow(non_snake_case)]
     pub fn square_and_negate_D(&self) -> FieldElement2625x4 {
         #[inline(always)]
         fn m(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> u64x4 {
@@ -463,16 +466,16 @@ impl FieldElement2625x4 {
         let x8_19  = m_lo(v19, x8);
         let x9_19  = m_lo(v19, x9);
 
-        let mut z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
-        let mut z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
-        let mut z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
-        let mut z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
-        let mut z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
-        let mut z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
-        let mut z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
-        let mut z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
-        let mut z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
-        let mut z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
+        let z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
+        let z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
+        let z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
+        let z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
+        let z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
+        let z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
+        let z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
+        let z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
+        let z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
+        let z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
 
 
         let low__p37 = u64x4::splat(0x3ffffed << 37);
@@ -675,7 +678,7 @@ mod test {
 
     #[test]
     fn scale_by_curve_constants() {
-        let mut x = FieldElement2625x4::splat(&FieldElement51::one());
+        let mut x = FieldElement2625x4::splat(&FieldElement51::ONE);
 
         x = x * (121666, 121666, 2*121666, 2*121665);

src/backend/vector/neon/edwards.rs

diff --git a/src/backend/vector/neon/edwards.rs b/src/backend/vector/neon/edwards.rs
index 8073c3a..357d4e3 100644
--- a/src/backend/vector/neon/edwards.rs
+++ b/src/backend/vector/neon/edwards.rs
@@ -321,14 +321,14 @@ mod test {
     use super::*;
 
     fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint {
-        use backend::serial::u64::field::FieldElement51;
+        use crate::backend::serial::u64::field::FieldElement51;
 
         let (X1, Y1, Z1, T1) = (P.X, P.Y, P.Z, P.T);
         let (X2, Y2, Z2, T2) = (Q.X, Q.Y, Q.Z, Q.T);
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -411,8 +411,8 @@ mod test {
 
     #[test]
     fn vector_addition_vs_serial_addition_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing id +- id");
         let P = edwards::EdwardsPoint::identity();
@@ -440,7 +440,7 @@ mod test {
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -498,8 +498,8 @@ mod test {
 
     #[test]
     fn vector_doubling_vs_serial_doubling_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing [2]id");
         let P = edwards::EdwardsPoint::identity();
@@ -516,8 +516,8 @@ mod test {
 
     #[test]
     fn basepoint_odd_lookup_table_verify() {
-        use constants;
-        use backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
+        use crate::constants;
+        use crate::backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
 
         let basepoint_odd_table = NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT);
         println!("Testing basepoint table");

clippy is still whining but I'll send a .patch later - we are using the 2021 edition

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly build --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly clippy --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly bench --features rand_core

rubdos · 2022-12-14T09:07:10Z

I've done the rebase together with your patches, and squashed the NEON work into the first commit. I had not tested the rebase on 4.0, sorry about that. I'll do that now.

rubdos · 2022-12-14T09:12:19Z

Working on the rustfmt attributes now.

pinkforest · 2022-12-14T09:46:30Z

started a benchmark repo here -

I ran some benchmarks earlier (will re-run now) on Mac M1 Max Aarch64 comparing between backends FWIW

I'm going to re-organise it to cover wider range benchmarks between various impl's not just dalek's

rubdos · 2022-12-14T10:32:51Z

I'll see whether I'll have all the data in that format. I currently only retain the raw outputs from Criterion for further automatic processing for our paper.

rubdos · 2024-02-26T10:13:59Z

This took a bit longer than anticipated. We got a publication out, so academia-wise, we're all happy now. In between writing that paper and getting it out somewhere, there was quite some time, but I'm trying to sum up the current remaining issues with this branch in this post.

Results are especially spectacular on ARMv7/32-bit ARM CPUs (~50% speedups). Results are still great on ARMv8/aarch64 (~20%) until ARMv8.2-ish: when run in A64 mode, we get slow-downs relative to base. This is because ARM decided to nerf the TBL/TBX instructions relative to previous versions. The throughput remained the same, but the total latency of the instruction is now a function of the input size.

(shamelessly stolen from https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf)

We'll require a bit of in-depth looking into the assembly to figure out a way around this.
The TBL/TBX instruction comes from various vqtbx1q_u8 calls in FieldElement2625x4::blend. @Tarinn made a note that we can probably use a combination of vset/vget to get blend faster, but we don't currently have the hardware to reproduce this.

We sent some executables to @direc85 over in Finland to get the benchmark run on an Xperia 10 III, which is how we noted the issue. That's not really a tight debuggable loop. I should soon™ have an Xperia 10 IV to close the loop a bit.

Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>

…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>

rubdos · 2024-02-28T12:19:14Z

(that's a disfunctional rebase, we still have to reintegrate in the new backend-selection)

pinkforest · 2024-02-28T12:31:00Z

Awesome -

For CI there is ARM already for cross - .github/workflows/cross.yml#L19

We also have simd/x86_64-AVX related tests in .github/worksflows/curve25519-dalek.yml#L118

But I think we could split backend tests into a separate file - qemu should support neon at this level - backends.yml for the sake of clarity and discoverability.

I'm re-working the build script so I could include a placeholder separately there so it doesn't cause conflict.

curve25519: Remove build-dep:platforms in favor of CARGO_CFG_TARGET_POINTER_WIDTH #636

Also nightly CI is currently failing - I've fixed it here:

Fix new nightly redundant import lint warns #638

jrose-signal · 2024-03-22T19:09:26Z

Today I idly tested this on my M1 Mac and the current state of the branch is a regression from the u64 backend. Changing the implementation of blend to use core::simd::Mask::select (with constant masks) got it back to about par with the u64 backend. I wasn't being super scientific with this (I was doing other things while the benchmark ran, for example, though nothing CPU-intensive), but it was a big enough gap to be indicative. There's probably something better than Mask, but just confirming that the TBL slowdown is serious. 😭

rubdos · 2024-03-23T07:35:03Z

There's probably something better than Mask, but just confirming that the TBL slowdown is serious. 😭

Yep, it's quite devastating indeed... If we've got news on that front, we'll post our findings as soon as we have them. Thanks for testing it out! ❤️

rubdos · 2024-05-22T14:54:41Z

@Tarinn made some progress by now on the aarch64 problem, but we're not there yet.

We thought we would have some speedup by working around llvm/llvm-project/issues/58323 (some vget instrinsics don't get correctly lowered to DUP/MOV instructions), but manually throwing in the assembly didn't give a measurable improvement on our Odroid N2+.

The manual injection of assembly for `vget_high_u32`/`vget_low_u32`

diff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..316963d 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMpatchdiff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..cccde8c 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO,
 };
 
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_high_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[1]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_low_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[0]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_high_u32;
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_low_u32;
+
 macro_rules! shuffle {
     ($vec0:expr, $vec1:expr, $index:expr) => {
         unsafe {
             core::mem::transmute::<[u32; 4], u32x4>(
                 *core::simd::simd_swizzle!(
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)), 
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)), 
-                    $index).as_array())
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)),
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)),
+                    $index
+                )
+                .as_array(),
+            )
         }
     };
 }
@@ -53,8 +90,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
     let b0: u32x2;
     let b1: u32x2;
     unsafe {
-        use core::arch::aarch64::vget_high_u32;
-        use core::arch::aarch64::vget_low_u32;
         a0 = vget_low_u32(src.0.into()).into();
         a1 = vget_low_u32(src.1.into()).into();
         b0 = vget_high_u32(src.0.into()).into();
@@ -72,7 +107,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
 fn repack_pair(x: (u32x4, u32x4), y: (u32x4, u32x4)) -> (u32x4, u32x4) {
     unsafe {
         use core::arch::aarch64::vcombine_u32;
-        use core::arch::aarch64::vget_low_u32;
         use core::arch::aarch64::vgetq_lane_u32;
         use core::arch::aarch64::vset_lane_u32;
 
@@ -223,7 +257,6 @@ impl FieldElement2625x4 {
         self.shuffle(Shuffle::BACD)
     }
 
-
     // Can probably be sped up using multiple vset/vget instead of table
     #[inline]
     pub fn blend(&self, other: FieldElement2625x4, control: Lanes) -> FieldElement2625x4 {
@@ -318,8 +351,6 @@ impl FieldElement2625x4 {
         let rotated_carryout = |v: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 use core::arch::aarch64::vqshlq_u32;
 
                 let c: (u32x4, u32x4) = (
@@ -327,16 +358,8 @@ impl FieldElement2625x4 {
                     vqshlq_u32(v.1.into(), shifts.1.into()).into(),
                 );
                 (
-                    vcombine_u32(
-                        vget_high_u32(c.0.into()),
-                        vget_low_u32(c.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_high_u32(c.1.into()),
-                        vget_low_u32(c.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_high_u32(c.0.into()), vget_low_u32(c.0.into())).into(),
+                    vcombine_u32(vget_high_u32(c.1.into()), vget_low_u32(c.1.into())).into(),
                 )
             }
         };
@@ -344,19 +367,9 @@ impl FieldElement2625x4 {
         let combine = |v_lo: (u32x4, u32x4), v_hi: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 (
-                    vcombine_u32(
-                        vget_low_u32(v_lo.0.into()),
-                        vget_high_u32(v_hi.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_low_u32(v_lo.1.into()),
-                        vget_high_u32(v_hi.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_low_u32(v_lo.0.into()), vget_high_u32(v_hi.0.into())).into(),
+                    vcombine_u32(vget_low_u32(v_lo.1.into()), vget_high_u32(v_hi.1.into())).into(),
                 )
             }
         };
@@ -386,7 +399,6 @@ impl FieldElement2625x4 {
         #[rustfmt::skip] // Retain formatting of return tuple
         let c9_19: (u32x4, u32x4) = unsafe {
             use core::arch::aarch64::vcombine_u32;
-            use core::arch::aarch64::vget_low_u32;
             use core::arch::aarch64::vmulq_n_u32;
 
             let c9_19_spread: (u32x4, u32x4) = (
@@ -475,8 +487,6 @@ impl FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -530,8 +540,6 @@ impl FieldElement2625x4 {
 
         let negate_D = |x_01: u64x4, p_01: u64x4| -> (u64x2, u64x2) {
             unsafe {
-                use core::arch::aarch64::vget_low_u32;
-                use core::arch::aarch64::vget_high_u32;
                 use core::arch::aarch64::vcombine_u32;
 
                 let x = x_01.0;
@@ -640,8 +648,6 @@ impl<'a, 'b> Mul<&'b FieldElement2625x4> for &'a FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -836,5 +842,3 @@ mod test {
         assert_eq!(x3, splits[3]);
     }
 }
-
-

Tarinn · 2024-08-21T11:18:45Z

Current state of this pull request gives speed-up on the previously problematic CPUs. It is less than the speed-up on older architectures, but still a minimum of ~5% and more often at least 10 to 12% on benchmarks that use the SIMD NEON back-end, in comparison to the serial back-end.
The latest commits to this branch have mainly focused on refactoring to use internal arm types (such as uint64x2x2_t) and trying to avoid the problematic llvm instructions.
I believe this makes the current branch usable, but more testing on different devices would be welcome. Maybe a patch of the vget_high intrinsic in llvm will result in even more speed-up, so that is something I will look into next.

Results of the benchmarks: target_a55.tar.gz

(merge conflicts should be resolved shortly)

rubdos force-pushed the dalek-neon branch from 343be3a to 26c494f Compare December 8, 2022 12:31

rubdos force-pushed the dalek-neon branch from 8f06b3c to 2d343e9 Compare December 8, 2022 14:15

tarcieri reviewed Dec 8, 2022

View reviewed changes

rubdos force-pushed the dalek-neon branch from 2d343e9 to 68c8173 Compare December 9, 2022 10:22

rubdos changed the base branch from main to release/4.0 December 9, 2022 10:24

rubdos force-pushed the dalek-neon branch from 68c8173 to 04125df Compare December 12, 2022 16:03

rubdos marked this pull request as ready for review December 12, 2022 16:04

pinkforest reviewed Dec 13, 2022

View reviewed changes

rubdos force-pushed the dalek-neon branch from 04125df to cf4da2c Compare December 14, 2022 08:51

rubdos changed the base branch from release/4.0 to main December 14, 2022 08:51

rubdos force-pushed the dalek-neon branch from cf4da2c to 408a00c Compare December 14, 2022 09:06

This was referenced Jun 1, 2023

CPU auto-detection feature / cfg gate #529

Closed

Runtime backend autodetection #523

Merged

tarcieri mentioned this pull request Jun 12, 2023

Simplifying backend selection #414

Closed

tarcieri mentioned this pull request Feb 6, 2024

Fix nightly build #619

Merged

Robrecht Blancquaert and others added 5 commits February 28, 2024 13:18

Added ARM neon backend support

cdffbbd

Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>

Use packed_simd::shuffle instead of vqtbx1q_u8

13b52a0

Use shuffle! macro instead of manual lane swapping in FieldElement262…

14e05d4

…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>

Rustfmt: retain some manual formatting

58a853d

rustfmt on neon constants

c49e465

rubdos force-pushed the dalek-neon branch from 792ed9d to c49e465 Compare February 28, 2024 12:18

implemented packed_simd for neon

38807d0

Tarinn force-pushed the dalek-neon branch from cba1784 to 38807d0 Compare March 1, 2024 16:52

fixed minor mistake

d2b7b31

Small improvements to neon backend

1c070b3

Tarinn added 4 commits June 3, 2024 16:24

repack_pair optimisation

f61eb5a

fixed small bug in repack_pair

6cffc06

changed to use internal arm types instead of tuples

b7d2f53

further refactoring to internal arm types

d524791

rubdos added 2 commits August 21, 2024 13:46

Merge remote-tracking branch 'origin/main' into dalek-neon

5c15111

Clippy

2aa38c9

rubdos force-pushed the dalek-neon branch from 701e9b5 to 4718b18 Compare August 21, 2024 12:07

Rustfmt

98d09ae

rubdos force-pushed the dalek-neon branch from 4718b18 to 98d09ae Compare August 21, 2024 12:12

rubdos mentioned this pull request Aug 22, 2024

Dalek NEON v7 #691

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEON backend for aarch64 #457

NEON backend for aarch64 #457

rubdos commented Dec 8, 2022 •

edited

Loading

rubdos commented Dec 8, 2022 •

edited

Loading

rubdos commented Dec 8, 2022

tarcieri Dec 8, 2022

rubdos Dec 8, 2022

tarcieri Dec 8, 2022 •

edited

Loading

rubdos Dec 9, 2022

tarcieri Dec 9, 2022

rubdos Dec 12, 2022

tarcieri commented Dec 8, 2022

jrose-signal commented Dec 12, 2022

rubdos commented Dec 13, 2022

pinkforest commented Dec 13, 2022 •

edited

Loading

rubdos commented Dec 13, 2022

pinkforest commented Dec 13, 2022

rubdos commented Dec 13, 2022

pinkforest Dec 13, 2022

pinkforest Dec 13, 2022 •

edited

Loading

pinkforest Dec 13, 2022

pinkforest commented Dec 13, 2022

rubdos commented Dec 14, 2022

rubdos commented Dec 14, 2022

pinkforest commented Dec 14, 2022 •

edited

Loading

rubdos commented Dec 14, 2022

rubdos commented Feb 26, 2024

rubdos commented Feb 28, 2024

pinkforest commented Feb 28, 2024 •

edited

Loading

jrose-signal commented Mar 22, 2024

rubdos commented Mar 23, 2024

rubdos commented May 22, 2024 •

edited

Loading

Tarinn commented Aug 21, 2024

	let mut out = [FieldElement51::zero(); 4];
	let mut out = [FieldElement51::ZERO; 4];

	println!("{} = {:?}", stringify!($x), $x.to_bytes());
	println!("{} = {:?}", stringify!($x), $x.as_bytes());

	use backend::serial::u64::field::FieldElement51;
	use crate::backend::serial::u64::field::FieldElement51;

NEON backend for aarch64 #457

Are you sure you want to change the base?

NEON backend for aarch64 #457

Conversation

rubdos commented Dec 8, 2022 • edited Loading

rubdos commented Dec 8, 2022 • edited Loading

rubdos commented Dec 8, 2022

tarcieri Dec 8, 2022

Choose a reason for hiding this comment

rubdos Dec 8, 2022

Choose a reason for hiding this comment

tarcieri Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

rubdos Dec 9, 2022

Choose a reason for hiding this comment

tarcieri Dec 9, 2022

Choose a reason for hiding this comment

rubdos Dec 12, 2022

Choose a reason for hiding this comment

tarcieri commented Dec 8, 2022

jrose-signal commented Dec 12, 2022

rubdos commented Dec 13, 2022

pinkforest commented Dec 13, 2022 • edited Loading

rubdos commented Dec 13, 2022

pinkforest commented Dec 13, 2022

rubdos commented Dec 13, 2022

pinkforest Dec 13, 2022

Choose a reason for hiding this comment

pinkforest Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

pinkforest Dec 13, 2022

Choose a reason for hiding this comment

pinkforest commented Dec 13, 2022

rubdos commented Dec 14, 2022

rubdos commented Dec 14, 2022

pinkforest commented Dec 14, 2022 • edited Loading

rubdos commented Dec 14, 2022

rubdos commented Feb 26, 2024

rubdos commented Feb 28, 2024

pinkforest commented Feb 28, 2024 • edited Loading

jrose-signal commented Mar 22, 2024

rubdos commented Mar 23, 2024

rubdos commented May 22, 2024 • edited Loading

Tarinn commented Aug 21, 2024

rubdos commented Dec 8, 2022 •

edited

Loading

rubdos commented Dec 8, 2022 •

edited

Loading

tarcieri Dec 8, 2022 •

edited

Loading

pinkforest commented Dec 13, 2022 •

edited

Loading

pinkforest Dec 13, 2022 •

edited

Loading

pinkforest commented Dec 14, 2022 •

edited

Loading

pinkforest commented Feb 28, 2024 •

edited

Loading

rubdos commented May 22, 2024 •

edited

Loading