Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM NEON support #32

Merged
merged 9 commits into from
Sep 4, 2019
Merged

ARM NEON support #32

merged 9 commits into from
Sep 4, 2019

Conversation

Licenser
Copy link
Member

No description provided.

@vielmetti
Copy link

@Licenser -

When you have something that builds, please let me and @WorksOnArm know - would love to provide test cycles and diverse hardware to check out performance on.

@Licenser
Copy link
Member Author

Thank you so much :D that's awesome!

@Licenser
Copy link
Member Author

Nooo! it seems like we're going to be blocked on missing intrinsics :(

https://doc.rust-lang.org/core/arch/aarch64/index.html / rust-lang/stdarch#148

@vielmetti
Copy link

@Licenser Do you have an inventory yet of intrinsics that you need / intrinsics that are missing? Reading the linked issue, sounds like there's slow progress.

@Licenser
Copy link
Member Author

Licenser commented Jul 31, 2019

Ah yes I made a list and then posted it to the wrong ticket ... silly me ...

Those are the intrinsics I found in @lemire's arm64 implementation

  • vaddq_s8
  • vandq_u8
  • vceqq_s8
  • vceqq_u8
  • vcgtq_s8
  • vcleq_u8
  • vdupq_n_s8
  • vdupq_n_u8
  • vextq_s8
  • vget_lane_u64 (broken implementation have not figured it out so far but v.X works)
  • vgetq_lane_u16 (broken implementation have not figured it out so far but v.X works)
  • vgetq_lane_u32 (broken implementation have not figured it out so far but v.X works)
  • vgetq_lane_u64 (broken implementation have not figured it out so far but v.X works)
  • vld1q_s8
  • vld1q_u8
  • vmovq_n_u8
  • vmull_p64
  • vorrq_s8
  • vorrq_u8
  • vpaddq_u8
  • vqmovn_u64
  • vqsubq_u8
  • vqtbl1q_s8
  • vqtbl1q_u8
  • vreinterpret_u64_u32
  • vreinterpretq_s8_u8
  • vreinterpretq_u16_u8
  • vreinterpretq_u32_u8
  • vreinterpretq_u64_u8
  • vreinterpretq_u8_s8
  • vshrq_n_u8
  • vst1q_u8

(there is a full list of missing instructions on the rust ticket - those are the required ones for porting simdjson.rs)

@sunnygleason
Copy link
Member

@Licenser alas! Would you be open to the possibility of a PR that uses assembly macros in the meantime? Maybe it won't be that far off from the intrinsic version...

@Licenser
Copy link
Member Author

Licenser commented Aug 1, 2019

Absolutely, I also gave you contributor permission so no or required;) I might take a look on the weekend to see what is required to get the intrinsics at least into nighly

@sunnygleason
Copy link
Member

@Licenser that's awesome - I'll take a pass at defining some intrinsics in 'src/neon/intrinsics.rs', and we can compare notes as you work with nightly!

@Licenser
Copy link
Member Author

Licenser commented Aug 1, 2019

I started working on a pull request: rust-lang/stdarch#792

@lemire
Copy link

lemire commented Aug 1, 2019

We just published simdjson 0.2.0 with NEON support...

@Licenser
Copy link
Member Author

Licenser commented Aug 1, 2019

Huzza!

Licenser and others added 2 commits August 16, 2019 18:20
* feat: neon support
* feat: temp stub replacements for neon intrinsics (pending rust-lang/stdarch#792)
* fix: drone CI rustup nightly
* feat: fix guards, use rust stdlib for bit count operations
* fix: remove double semicolon
* feat: fancy generic generator functions, thanks @Licenser
@Licenser
Copy link
Member Author

OMG OMG OMG! this is great! :D

@sunnygleason
Copy link
Member

@Licenser are you thinking we might be able to merge this today and then have a subsequent PR to delete the intrinsics once everything's available in nightly? Thank you again for all your help. PS the new UTF8 tests look great!

@Licenser
Copy link
Member Author

I'd rather not, I could see that in resulting in some headache downstream if the intrinsics make it in and that'd be very, very, very hacky for a crate.

That said brave people ca already use it as a git dependency by pointing to the git branch.

@sunnygleason
Copy link
Member

Ah, that makes sense. What is taking you so long?!! ;)

@Licenser
Copy link
Member Author

Licenser commented Aug 16, 2019

@sunnygleason
Copy link
Member

sunnygleason commented Aug 17, 2019

Maybe I found something? Let me know what you think... https://godbolt.org/z/36hnUE

#[cfg(target_arch = "aarch64")]
#[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))]
#[rustc_args_required_const(1)]
pub unsafe fn vget_lane_u8(a: uint8x8_t, n: u32) -> u8 {
    if n < 0 || n > 7 {
        unreachable_unchecked();
    };
    match n {
        0 => a.0,
        1 => a.1,
        2 => a.2,
        3 => a.3,
        4 => a.4,
        5 => a.5,
        6 => a.6,
        7 => a.7,
        _ => unreachable_unchecked()
    }
}

image

(Also clang: https://clang.godbolt.org/z/TpqJIp)

@sunnygleason
Copy link
Member

@Licenser I think your vld1q is all set, since the intrinsic turns into ldr anyway?

image

image

@Licenser
Copy link
Member Author

Oh that's a very good catch! then the ld1 commands are indeed done :D

for vget_lane_u8 yes that one is possiuble, the other ones are that cause me headache :(

@Licenser
Copy link
Member Author

This leaves only those two functions:

// uint64_t vget_lane_u64 (uint64x1_t v, const int lane)
arm_vget_lane!(vget_lane_u64, uint64x1_t, u64, 0);

#[simd_test(enable = "neon")]
unsafe fn test_vget_lane_u64() {
    let v = i64x1::new(1);
    let lane = 0;
    let r = vget_lane_u64(transmute(v), lane);
    assert_eq!(r, 1);
}


// uint32_t vgetq_lane_u32 (uint32x4_t v, const int lane)
arm_vget_lane!(vgetq_lane_u32, uint32x4_t, u32, 3);

#[simd_test(enable = "neon")]
unsafe fn test_vgetq_lane_u32() {
    let v = i32x4::new(1, 2, 3, 4);
    let lane = 1;
    let r = vgetq_lane_u32(transmute(v), lane);
    assert_eq!(r, 2);
}

@Licenser
Copy link
Member Author

vget_lane_u64 is done, I took your approach Sunny and verified that gcc also turns it into a fmov instead of a vmov commented and published it.

@sunnygleason
Copy link
Member

@Licenser that's awesome work... very nice!

I think some of the "ldr" confusion is because the operands are
in memory versus a register, and optimization gets in the way
for constants. I made this example with system time to prevent
optimization and show the "st1" usage: https://godbolt.org/z/SK6A7j

Does this look good? Let me know what you think!

All the best, -Sunny

Licenser and others added 2 commits September 3, 2019 16:12
* Use simd-lite
* Update badge
* Update badge
* Get rid of transmutes
* Use NeonInit trait
* vqsubq_u8 fix
* vqsubq_u8 fix pt. 2
* use reexprted values from simd-lite
@sunnygleason sunnygleason marked this pull request as ready for review September 4, 2019 14:53
@sunnygleason sunnygleason merged commit 1954f9b into master Sep 4, 2019
@Licenser Licenser deleted the arm branch October 7, 2019 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants