Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Neon support (pretty much working) #35

Merged
merged 34 commits into from
Aug 16, 2019

Conversation

sunnygleason
Copy link
Member

Hello hello!

I have been pulling some of your Neon intrinsics and porting the
simdjson neon code. Maybe it's useful! I'll keep improving it...
Comments welcome anytime!

All the best,

-Sunny

Cargo.toml Outdated Show resolved Hide resolved
src/neon/utf8check.rs Outdated Show resolved Hide resolved
src/value/generator.rs Outdated Show resolved Hide resolved
src/numberparse.rs Outdated Show resolved Hide resolved
@Licenser
Copy link
Member

Licenser commented Aug 8, 2019

That’s a great start!

@Licenser
Copy link
Member

Licenser commented Aug 9, 2019

We should have all the intrinsics (somewhat working) now! :D some are still brokenish however

src/neon/deser.rs Outdated Show resolved Hide resolved
src/neon/intrinsics.rs Outdated Show resolved Hide resolved
@sunnygleason
Copy link
Member Author

@Licenser I've uploaded some more work -- looking forward to your feedback! The intrinsics are mostly self-contained in intrinsics.rs; if you see any corrections/substitutions I can make, that would be awesome. I've mostly been just trying to get things to compile. I feel like we're getting pretty close once the intrinsics are valid!

@Licenser
Copy link
Member

Licenser commented Aug 9, 2019

Did a quick pass over your comments :) it's a bit late here so I'll take a closer look tomorrow.

@sunnygleason
Copy link
Member Author

@Licenser I think we're getting closer. My main issue now is how to get around these errors:

`error: `extern` block uses type `neon::intrinsics::int8x16_t` which is not FFI-safe: this struct has unspecified layout

Overall, seems like we're getting closer...

@sunnygleason
Copy link
Member Author

@Licenser ok, have it mostly building. Now I'm hitting issues in the compiler/link step I believe:

LLVM ERROR: Do not know how to scalarize the result of this operator!

src/neon/intrinsics.rs Outdated Show resolved Hide resolved
@Licenser
Copy link
Member

So on a aarch64 cpu this compiles!

@sunnygleason
Copy link
Member Author

@Licenser I think I'm at a point where I could use some help. I've cross-checked the implementation with the C++ version -- things seem mostly ok, but still only 21 tests pass. It would be great to have another set of eyes check this out! Thank you so much for your help.

@Licenser
Copy link
Member

Sure! There is a real chance that there is a bug in the intrinsics, who knows if I messed something up there :P it's not merged yet and for good reasons.

So I did a bit of the good old println debugging (or dbg! :P). It looks like things go bananas in flatten_bits.

With the input [1] (count2 test) the input and output is the following:

Before the call:

[src/neon/stage1.rs:521] &structurals = 7
[src/neon/stage1.rs:522] &idx = 64
[src/neon/stage1.rs:523] &structural_indexes = [
    0,
]

structurals is 7 (correct) 4 + 2 + 1 (first three bits, the first 3 characters are structurals, that's good!)

index: 64 since we had input length = 3 and we add 64 (we process that at a time) to the input (this might be where it geos wrong? but it's the same on avx and sse, so it should be right)

After the call:

[src/neon/stage1.rs:525] &structurals = 7
[src/neon/stage1.rs:526] &idx = 64
[src/neon/stage1.rs:527] &structural_indexes = [
    0,
    64,
    2,
    1,
]

this is after the call, indexes are 0, 64, 2 and 1. I suspect that is where we go oops, it should be 0, 1, 2, 3. I'll take a peak at flatten_bits and the intrinsics in it.

@sunnygleason
Copy link
Member Author

@Licenser that makes sense, I'm seeing the same thing - bad character at index 64. I have augmented my code with println debugging so I will be able to get back to it later this evening -- in the meantime, thank you so much for your help! :)

@Licenser
Copy link
Member

Not everything is working but the following diff makes quite a difference:

diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs
index a1e15a1..2af15e6 100644
--- a/src/neon/stage1.rs
+++ b/src/neon/stage1.rs
@@ -332,7 +332,7 @@ unsafe fn flatten_bits(base: &mut Vec<u32>, idx: u32, mut bits: u64) {
         let v3: i32 = static_cast_i32!(trailingzeroes(bits));
         bits &= bits.wrapping_sub(1);

-        let v: int32x4_t = int32x4_t::new(v3, v2, v1, v0);
+        let v: int32x4_t = int32x4_t::new(v0, v1, v2, v3);
         let v: int32x4_t = vaddq_s32(idx_64_v, v);

         std::ptr::write(base.as_mut_ptr().add(l) as *mut int32x4_t, v);

@Licenser
Copy link
Member

here is another one:

diff --git a/src/numberparse.rs b/src/numberparse.rs
index 425d16a..b44d320 100644
--- a/src/numberparse.rs
+++ b/src/numberparse.rs
@@ -133,10 +133,10 @@ pub enum Number {
 fn parse_eight_digits_unrolled(chars: &[u8]) -> u32 {
     let val: u64 = unsafe { *(chars.as_ptr() as *const u64) };
     //    memcpy(&val, chars, sizeof(u64));
-    let val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
-    let val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16;
+    let val = (val & 0x0F0F0F0F0F0F0F0F).wrapping_mul(2561) >> 8;
+    let val = (val & 0x00FF00FF00FF00FF).wrapping_mul(6553601) >> 16;

-    return ((val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32) as u32;
+    return ((val & 0x0000FFFF0000FFFF).wrapping_mul(42949672960001) >> 32) as u32;
 }

 impl<'de> Deserializer<'de> {

With those two we're up to 66 passing tests.

That said we shouldn't generally fall back to the slow parse_eight_digits_unrolled, it should be architecture specific I think.

@Licenser
Copy link
Member

The remaining failing tests are:

    tests::crazy_string
    tests::map2
    tests::odd_array2
    tests::odd_array3
    tests::odd_array4
    tests::prop_json
    tests::prop_json_encode_decode
    tests::unicode
    tests::utf8

I gut feeling says it's unicode/utf8 related as all ood_* cases also include silly UTF8 strings.

@sunnygleason
Copy link
Member Author

That's awesome. The utf8 stuff has a lot of weirdness with 'u8' vs 'i8' vs 's8' and reinterpretation, so I'm not totally surprised. It could be reinterpretation issue, or me slightly modifying intrinsic(s). Pretty awesome to be at 66 passing tests, nice work!

@sunnygleason
Copy link
Member Author

(Also, I can confirm the 66 passing tests in my ARM environment)

@Licenser
Copy link
Member

All credit to you on the work I just swapped a * with a wrapping_mul (rust trivia: that would have worked in a release build AFAIK overflow checks are only enabled in dev / test builds) and changed the order of 4 variables :P I got the easy part!

@sunnygleason sunnygleason force-pushed the neon-support branch 2 times, most recently from f544f03 to faea015 Compare August 15, 2019 18:14
@sunnygleason
Copy link
Member Author

sunnygleason commented Aug 16, 2019

OK, this should be better!

@sunnygleason sunnygleason changed the title RFC: Neon support (work in progress/broken) RFC: Neon support (pretty much working) Aug 16, 2019
src/neon/stage1.rs Show resolved Hide resolved
#[cfg(target_feature = "avx2")]
const AVX2_PRESENT : bool = true;
#[cfg(not(target_feature = "avx2"))]
const AVX2_PRESENT : bool = false;
#[cfg(target_feature = "sse4.2")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are not needed, we don't need to do runtime checks. you can simply put the whole thing in a block and guard the block with a #[cfg] directive :) that'll prevent us from having to do it over and over again.

The compiler might (should) optimize a if false away but having the #[cfg] guard around the code that isn't executed makes it clearer when reading the code that it's something that happens at compile not at runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use macros; the trait modeling is slightly beyond my experience level since we're going outside the current module, advice welcome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a diff at the end of this but I'll walk through the theory first:

So rust is quite powerful when it comes to generics and the dependant compilation of whole module trees.

We'll use that to our advantage to make the implementation zero overhead (and I don't mean nearly zero overhead but just have it slot in where it is requred at no cost at runtime not even a nop'ed out funciton call).

1st: We call the function the same in all implementations. This allows us to use a dependant import the same way we use it for the stage1 and other modules.

2nd: We pull the implementation out of the trait and make it generic over the writer. Generic here doesn't mean dynamic dispatch. Rust will compile the function for each generic variant that is used. In our case we make it generic over the writer (called W in the implementation).

3rd: NEON case (we ca reuse the neon_movemask from stage1 - I think this eventually should go into a file of shared functions but I wanted to keep the diff small.

4th: follow the same step for sse and avx.

result: No macros, no runtime overhead. Yay!

The whole rust generics, especially when mixed with dependant compilation and multiple architectures, take a some time to wrap your head around :D it's one of those things that just require you to do run in a wall 15000 times until it clicks!

diff --git a/src/neon/generator.rs b/src/neon/generator.rs
index 7e0c8b6..7c9eaad 100644
--- a/src/neon/generator.rs
+++ b/src/neon/generator.rs
@@ -1,32 +1,10 @@
+use crate::value::generator::ESCAPED;
+use std::io;
+use crate::neon::intrinsics::*;
+use crate::neon::stage1::neon_movemask;

-#[macro_export]
-macro_rules! process_32_bytes {
-    () => {
         #[inline(always)]
-        unsafe fn process_32_bytes(&mut self, _string: &mut &[u8], _len: &mut usize, _idx: &mut usize) -> io::Result < () > {
-            Ok(())
-        }
-    };
-}
-
-#[macro_export]
-macro_rules! process_16_bytes {
-    () => {
-        #[inline(always)]
-        unsafe fn process_16_bytes(&mut self, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> {
-            #[cfg_attr(not(feature = "no-inline"), inline(always))]
-            unsafe fn __neon_movemask(input: uint8x16_t) -> u16 {
-                let bit_mask = uint8x16_t::new(
-                    0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
-                    0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
-                );
-                let minput: uint8x16_t = vandq_u8(input, bit_mask);
-                let tmp: uint8x16_t = vpaddq_u8(minput, minput);
-                let tmp = vpaddq_u8(tmp, tmp);
-                let tmp = vpaddq_u8(tmp, tmp);
-
-                vgetq_lane_u16(vreinterpretq_u16_u8(tmp), 0)
-            }
+        pub unsafe fn write_str_simd<W>(writer: &mut W, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> where W: std::io::Write {

             // The case where we have a 16+ byte block
             // we repeate the same logic as above but with
@@ -48,15 +26,15 @@ macro_rules! process_16_bytes {
                 // now.
                 let is_unchanged = vxorrq_u8(data, in_quote_range);
                 let in_range = vceqq_u8(is_unchanged, zero);
-                let quote_bits = __neon_movemask(vorrq_u8(bs_or_quote, in_range));
+                let quote_bits = neon_movemask(vorrq_u8(bs_or_quote, in_range));
                 if quote_bits != 0 {
                     let quote_dist = quote_bits.trailing_zeros() as usize;
-                    stry!(self.get_writer().write_all(&string[0..*idx + quote_dist]));
+                    stry!(writer.write_all(&string[0..*idx + quote_dist]));
                     let ch = string[*idx + quote_dist];
                     match ESCAPED[ch as usize] {
-                        b'u' => stry!(write!(self.get_writer(), "\\u{:04x}", ch)),
+                        b'u' => stry!(write!(writer, "\\u{:04x}", ch)),

-                        escape => stry!(self.write(&[b'\\', escape])),
+                        escape => stry!(writer.write_all(&[b'\\', escape])),
                     };
                     *string = &string[*idx + quote_dist + 1..];
                     *idx = 0;
@@ -65,9 +43,7 @@ macro_rules! process_16_bytes {
                     *idx += 16;
                 }
             }
-            stry!(self.get_writer().write_all(&string[0..*idx]));
+            stry!(writer.write_all(&string[0..*idx]));
             *string = &string[*idx..];
             Ok(())
         }
-    };
-}
\ No newline at end of file
diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs
index 0f00857..45322ce 100644
--- a/src/neon/stage1.rs
+++ b/src/neon/stage1.rs
@@ -18,7 +18,7 @@ macro_rules! bit_mask {
 }

 #[cfg_attr(not(feature = "no-inline"), inline(always))]
-unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
+pub(crate) unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
     let minput: uint8x16_t = vandq_u8(input, bit_mask!());
     let tmp: uint8x16_t = vpaddq_u8(minput, minput);
     let tmp = vpaddq_u8(tmp, tmp);
@@ -28,7 +28,7 @@ unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
 }

 #[cfg_attr(not(feature = "no-inline"), inline(always))]
-unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 {
+pub unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 {
     let bit_mask = bit_mask!();

     let t0 = vandq_u8(p0, bit_mask);
diff --git a/src/value.rs b/src/value.rs
index 0624587..c5d5aa8 100644
--- a/src/value.rs
+++ b/src/value.rs
@@ -11,7 +11,7 @@
 /// we do not require prior knowledge sbout string comtent to to take advantage
 /// of it.
 pub mod borrowed;
-mod generator;
+pub(crate) mod generator;
 pub mod owned;

 pub use self::borrowed::{to_value as to_borrowed_value, Value as BorrowedValue};
diff --git a/src/value/generator.rs b/src/value/generator.rs
index b053344..007acc1 100644
--- a/src/value/generator.rs
+++ b/src/value/generator.rs
@@ -12,19 +12,19 @@ use std::ptr;

 #[cfg(target_arch = "x86")]
 use std::arch::x86::*;
-#[cfg(any(target_feature = "sse4.2", target_feature = "avx2"))]
+#[cfg(target_arch = "x86_64")]
 use std::arch::x86_64::*;

 use crate::*;

 #[cfg(target_feature = "avx2")]
-pub use crate::avx2::generator::*;
+use crate::avx2::generator::*;

 #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), not(target_feature = "avx2")))]
-pub use crate::sse42::generator::*;
+use crate::sse42::generator::*;

 #[cfg(target_feature = "neon")]
-pub use crate::neon::generator::*;
+use crate::neon::generator::*;

 const QU: u8 = b'"';
 const BS: u8 = b'\\';
@@ -37,7 +37,7 @@ const UU: u8 = b'u';
 const __: u8 = 0;

 // Look up table for characters that need escaping in a product string
-static ESCAPED: [u8; 256] = [
+pub(crate) static ESCAPED: [u8; 256] = [
     // 0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
     UU, UU, UU, UU, UU, UU, UU, UU, BB, TT, NN, UU, FF, RR, UU, UU, // 0
     UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, // 1
@@ -113,8 +113,7 @@ pub trait BaseGenerator {
             // quote characters that gives us a bitmask of 0x1f for that
             // region, only quote (`"`) and backslash (`\`) are not in
             // this range.
-            stry!(self.process_32_bytes(&mut string, &mut len, &mut idx));
-            stry!(self.process_32_bytes(&mut string, &mut len, &mut idx));
+            stry!(write_str_simd(self.get_writer(), &mut string, &mut len, &mut idx));
         }
         // Legacy code to handle the remainder of the code
         for (index, ch) in string.iter().enumerate() {
@@ -127,12 +126,6 @@ pub trait BaseGenerator {
         self.write_char(b'"')
     }

-    // 32-byte generation implementation
-    process_32_bytes!();
-
-    // 16-byte generation implementation
-    process_16_bytes!();
-
     #[inline(always)]
     fn write_float(&mut self, num: f64) -> io::Result<()> {
         let mut buffer = ryu::Buffer::new();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is amazing 🚀 ... integrating!

src/value/generator.rs Outdated Show resolved Hide resolved
src/value/generator.rs Outdated Show resolved Hide resolved
src/value/generator.rs Outdated Show resolved Hide resolved
src/neon/stage1.rs Show resolved Hide resolved
@sunnygleason
Copy link
Member Author

@Licenser ok this is weird, arm64 passes for me in EC2 but not on drone. Are you able to repro the test failure?

src/neon/stage1.rs Outdated Show resolved Hide resolved
@sunnygleason
Copy link
Member Author

@Licenser success!

@Licenser
Copy link
Member

We are getting there :D

@Licenser
Copy link
Member

Now I just need to get the last 5 intrinsics working :/ I split the MR in the hope that we can get 90% merged while I try to wrangle the remaining ones.

src/value/generator.rs Outdated Show resolved Hide resolved
src/value/generator.rs Outdated Show resolved Hide resolved
src/sse42/generator.rs Outdated Show resolved Hide resolved
src/neon/generator.rs Outdated Show resolved Hide resolved
src/avx2/generator.rs Outdated Show resolved Hide resolved
@sunnygleason
Copy link
Member Author

@Licenser ok, hopefully it works! :)

Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work! :D

@sunnygleason sunnygleason merged commit 9f7b8a9 into simd-lite:arm Aug 16, 2019
sunnygleason pushed a commit that referenced this pull request Sep 4, 2019
* Put something in the readme so we can have a PR
* Add drone file
* update build status
* unguard for sse4.2 to allow rust to polyfill on older platforms
* Add more simd tests
* RFC: Neon support (pretty much working) (#35)
* feat: neon support
* feat: temp stub replacements for neon intrinsics (pending rust-lang/stdarch#792)
* fix: drone CI rustup nightly
* feat: fix guards, use rust stdlib for bit count operations
* fix: remove double semicolon
* feat: fancy generic generator functions, thanks @Licenser
* Update extq intrinsics
* Use simd-lite (#39)
* Use simd-lite
* Update badge
* Update badge
* Get rid of transmutes
* Use NeonInit trait
* vqsubq_u8 fix
* vqsubq_u8 fix pt. 2
* use reexprted values from simd-lite
* add simd-lite real version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants