RFC: Neon support (pretty much working) #35

sunnygleason · 2019-08-08T02:54:23Z

Hello hello!

I have been pulling some of your Neon intrinsics and porting the
simdjson neon code. Maybe it's useful! I'll keep improving it...
Comments welcome anytime!

All the best,

-Sunny

Cargo.toml

src/neon/intrinsics.rs

src/neon/simd.rs

src/neon/intrinsics.rs

src/neon/simd_llvm.rs

src/neon/utf8check.rs

src/portability.rs

src/value/generator.rs

src/numberparse.rs

Licenser · 2019-08-08T05:45:33Z

That’s a great start!

Licenser · 2019-08-09T22:24:32Z

We should have all the intrinsics (somewhat working) now! :D some are still brokenish however

src/neon/deser.rs

src/neon/intrinsics.rs

src/value/generator.rs

sunnygleason · 2019-08-09T23:05:21Z

@Licenser I've uploaded some more work -- looking forward to your feedback! The intrinsics are mostly self-contained in intrinsics.rs; if you see any corrections/substitutions I can make, that would be awesome. I've mostly been just trying to get things to compile. I feel like we're getting pretty close once the intrinsics are valid!

Licenser · 2019-08-09T23:15:54Z

Did a quick pass over your comments :) it's a bit late here so I'll take a closer look tomorrow.

sunnygleason · 2019-08-10T12:12:02Z

@Licenser I think we're getting closer. My main issue now is how to get around these errors:

`error: `extern` block uses type `neon::intrinsics::int8x16_t` which is not FFI-safe: this struct has unspecified layout

Overall, seems like we're getting closer...

sunnygleason · 2019-08-10T13:40:49Z

@Licenser ok, have it mostly building. Now I'm hitting issues in the compiler/link step I believe:

LLVM ERROR: Do not know how to scalarize the result of this operator!

src/neon/intrinsics.rs

Licenser · 2019-08-10T19:40:26Z

So on a aarch64 cpu this compiles!

sunnygleason · 2019-08-13T04:32:04Z

@Licenser I think I'm at a point where I could use some help. I've cross-checked the implementation with the C++ version -- things seem mostly ok, but still only 21 tests pass. It would be great to have another set of eyes check this out! Thank you so much for your help.

Licenser · 2019-08-13T13:54:07Z

Sure! There is a real chance that there is a bug in the intrinsics, who knows if I messed something up there :P it's not merged yet and for good reasons.

So I did a bit of the good old println debugging (or dbg! :P). It looks like things go bananas in flatten_bits.

With the input [1] (count2 test) the input and output is the following:

Before the call:

[src/neon/stage1.rs:521] &structurals = 7
[src/neon/stage1.rs:522] &idx = 64
[src/neon/stage1.rs:523] &structural_indexes = [
    0,
]

structurals is 7 (correct) 4 + 2 + 1 (first three bits, the first 3 characters are structurals, that's good!)

index: 64 since we had input length = 3 and we add 64 (we process that at a time) to the input (this might be where it geos wrong? but it's the same on avx and sse, so it should be right)

After the call:

[src/neon/stage1.rs:525] &structurals = 7
[src/neon/stage1.rs:526] &idx = 64
[src/neon/stage1.rs:527] &structural_indexes = [
    0,
    64,
    2,
    1,
]

this is after the call, indexes are 0, 64, 2 and 1. I suspect that is where we go oops, it should be 0, 1, 2, 3. I'll take a peak at flatten_bits and the intrinsics in it.

sunnygleason · 2019-08-13T14:56:55Z

@Licenser that makes sense, I'm seeing the same thing - bad character at index 64. I have augmented my code with println debugging so I will be able to get back to it later this evening -- in the meantime, thank you so much for your help! :)

Licenser · 2019-08-13T14:57:44Z

Not everything is working but the following diff makes quite a difference:

diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs
index a1e15a1..2af15e6 100644
--- a/src/neon/stage1.rs
+++ b/src/neon/stage1.rs
@@ -332,7 +332,7 @@ unsafe fn flatten_bits(base: &mut Vec<u32>, idx: u32, mut bits: u64) {
         let v3: i32 = static_cast_i32!(trailingzeroes(bits));
         bits &= bits.wrapping_sub(1);

-        let v: int32x4_t = int32x4_t::new(v3, v2, v1, v0);
+        let v: int32x4_t = int32x4_t::new(v0, v1, v2, v3);
         let v: int32x4_t = vaddq_s32(idx_64_v, v);

         std::ptr::write(base.as_mut_ptr().add(l) as *mut int32x4_t, v);

Licenser · 2019-08-13T15:08:21Z

here is another one:

diff --git a/src/numberparse.rs b/src/numberparse.rs
index 425d16a..b44d320 100644
--- a/src/numberparse.rs
+++ b/src/numberparse.rs
@@ -133,10 +133,10 @@ pub enum Number {
 fn parse_eight_digits_unrolled(chars: &[u8]) -> u32 {
     let val: u64 = unsafe { *(chars.as_ptr() as *const u64) };
     //    memcpy(&val, chars, sizeof(u64));
-    let val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
-    let val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16;
+    let val = (val & 0x0F0F0F0F0F0F0F0F).wrapping_mul(2561) >> 8;
+    let val = (val & 0x00FF00FF00FF00FF).wrapping_mul(6553601) >> 16;

-    return ((val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32) as u32;
+    return ((val & 0x0000FFFF0000FFFF).wrapping_mul(42949672960001) >> 32) as u32;
 }

 impl<'de> Deserializer<'de> {

With those two we're up to 66 passing tests.

That said we shouldn't generally fall back to the slow parse_eight_digits_unrolled, it should be architecture specific I think.

Licenser · 2019-08-13T15:12:41Z

The remaining failing tests are:

    tests::crazy_string
    tests::map2
    tests::odd_array2
    tests::odd_array3
    tests::odd_array4
    tests::prop_json
    tests::prop_json_encode_decode
    tests::unicode
    tests::utf8

I gut feeling says it's unicode/utf8 related as all ood_* cases also include silly UTF8 strings.

sunnygleason · 2019-08-13T15:15:15Z

That's awesome. The utf8 stuff has a lot of weirdness with 'u8' vs 'i8' vs 's8' and reinterpretation, so I'm not totally surprised. It could be reinterpretation issue, or me slightly modifying intrinsic(s). Pretty awesome to be at 66 passing tests, nice work!

sunnygleason · 2019-08-13T15:16:23Z

(Also, I can confirm the 66 passing tests in my ARM environment)

Licenser · 2019-08-13T15:29:50Z

All credit to you on the work I just swapped a * with a wrapping_mul (rust trivia: that would have worked in a release build AFAIK overflow checks are only enabled in dev / test builds) and changed the order of 4 variables :P I got the easy part!

sunnygleason · 2019-08-16T02:40:59Z

OK, this should be better!

let me know what the strategy should be re: nightly & merging
arch refactor preview here : feat: architecture refactoring for greater similarity sunnygleason/simdjson-rs#1
I'll stay tuned for any other comments/questions/concerns, thanks again!

…x86/x86_64 platforms

src/neon/stage1.rs

Licenser · 2019-08-16T15:19:20Z

src/value/generator.rs

 #[cfg(target_feature = "avx2")]
 const AVX2_PRESENT : bool = true;
 #[cfg(not(target_feature = "avx2"))]
 const AVX2_PRESENT : bool = false;
+#[cfg(target_feature = "sse4.2")]


those are not needed, we don't need to do runtime checks. you can simply put the whole thing in a block and guard the block with a #[cfg] directive :) that'll prevent us from having to do it over and over again.

The compiler might (should) optimize a if false away but having the #[cfg] guard around the code that isn't executed makes it clearer when reading the code that it's something that happens at compile not at runtime.

updated to use macros; the trait modeling is slightly beyond my experience level since we're going outside the current module, advice welcome!

I'll add a diff at the end of this but I'll walk through the theory first:

So rust is quite powerful when it comes to generics and the dependant compilation of whole module trees.

We'll use that to our advantage to make the implementation zero overhead (and I don't mean nearly zero overhead but just have it slot in where it is requred at no cost at runtime not even a nop'ed out funciton call).

1st: We call the function the same in all implementations. This allows us to use a dependant import the same way we use it for the stage1 and other modules.

2nd: We pull the implementation out of the trait and make it generic over the writer. Generic here doesn't mean dynamic dispatch. Rust will compile the function for each generic variant that is used. In our case we make it generic over the writer (called W in the implementation).

3rd: NEON case (we ca reuse the neon_movemask from stage1 - I think this eventually should go into a file of shared functions but I wanted to keep the diff small.

4th: follow the same step for sse and avx.

result: No macros, no runtime overhead. Yay!

The whole rust generics, especially when mixed with dependant compilation and multiple architectures, take a some time to wrap your head around :D it's one of those things that just require you to do run in a wall 15000 times until it clicks!

diff --git a/src/neon/generator.rs b/src/neon/generator.rs index 7e0c8b6..7c9eaad 100644 --- a/src/neon/generator.rs +++ b/src/neon/generator.rs @@ -1,32 +1,10 @@ +use crate::value::generator::ESCAPED; +use std::io; +use crate::neon::intrinsics::*; +use crate::neon::stage1::neon_movemask; -#[macro_export] -macro_rules! process_32_bytes { - () => { #[inline(always)] - unsafe fn process_32_bytes(&mut self, _string: &mut &[u8], _len: &mut usize, _idx: &mut usize) -> io::Result < () > { - Ok(()) - } - }; -} - -#[macro_export] -macro_rules! process_16_bytes { - () => { - #[inline(always)] - unsafe fn process_16_bytes(&mut self, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> { - #[cfg_attr(not(feature = "no-inline"), inline(always))] - unsafe fn __neon_movemask(input: uint8x16_t) -> u16 { - let bit_mask = uint8x16_t::new( - 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, - 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 - ); - let minput: uint8x16_t = vandq_u8(input, bit_mask); - let tmp: uint8x16_t = vpaddq_u8(minput, minput); - let tmp = vpaddq_u8(tmp, tmp); - let tmp = vpaddq_u8(tmp, tmp); - - vgetq_lane_u16(vreinterpretq_u16_u8(tmp), 0) - } + pub unsafe fn write_str_simd<W>(writer: &mut W, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> where W: std::io::Write { // The case where we have a 16+ byte block // we repeate the same logic as above but with @@ -48,15 +26,15 @@ macro_rules! process_16_bytes { // now. let is_unchanged = vxorrq_u8(data, in_quote_range); let in_range = vceqq_u8(is_unchanged, zero); - let quote_bits = __neon_movemask(vorrq_u8(bs_or_quote, in_range)); + let quote_bits = neon_movemask(vorrq_u8(bs_or_quote, in_range)); if quote_bits != 0 { let quote_dist = quote_bits.trailing_zeros() as usize; - stry!(self.get_writer().write_all(&string[0..*idx + quote_dist])); + stry!(writer.write_all(&string[0..*idx + quote_dist])); let ch = string[*idx + quote_dist]; match ESCAPED[ch as usize] { - b'u' => stry!(write!(self.get_writer(), "\\u{:04x}", ch)), + b'u' => stry!(write!(writer, "\\u{:04x}", ch)), - escape => stry!(self.write(&[b'\\', escape])), + escape => stry!(writer.write_all(&[b'\\', escape])), }; *string = &string[*idx + quote_dist + 1..]; *idx = 0; @@ -65,9 +43,7 @@ macro_rules! process_16_bytes { *idx += 16; } } - stry!(self.get_writer().write_all(&string[0..*idx])); + stry!(writer.write_all(&string[0..*idx])); *string = &string[*idx..]; Ok(()) } - }; -} \ No newline at end of file diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs index 0f00857..45322ce 100644 --- a/src/neon/stage1.rs +++ b/src/neon/stage1.rs @@ -18,7 +18,7 @@ macro_rules! bit_mask { } #[cfg_attr(not(feature = "no-inline"), inline(always))] -unsafe fn neon_movemask(input: uint8x16_t) -> u16 { +pub(crate) unsafe fn neon_movemask(input: uint8x16_t) -> u16 { let minput: uint8x16_t = vandq_u8(input, bit_mask!()); let tmp: uint8x16_t = vpaddq_u8(minput, minput); let tmp = vpaddq_u8(tmp, tmp); @@ -28,7 +28,7 @@ unsafe fn neon_movemask(input: uint8x16_t) -> u16 { } #[cfg_attr(not(feature = "no-inline"), inline(always))] -unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 { +pub unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 { let bit_mask = bit_mask!(); let t0 = vandq_u8(p0, bit_mask); diff --git a/src/value.rs b/src/value.rs index 0624587..c5d5aa8 100644 --- a/src/value.rs +++ b/src/value.rs @@ -11,7 +11,7 @@ /// we do not require prior knowledge sbout string comtent to to take advantage /// of it. pub mod borrowed; -mod generator; +pub(crate) mod generator; pub mod owned; pub use self::borrowed::{to_value as to_borrowed_value, Value as BorrowedValue}; diff --git a/src/value/generator.rs b/src/value/generator.rs index b053344..007acc1 100644 --- a/src/value/generator.rs +++ b/src/value/generator.rs @@ -12,19 +12,19 @@ use std::ptr; #[cfg(target_arch = "x86")] use std::arch::x86::*; -#[cfg(any(target_feature = "sse4.2", target_feature = "avx2"))] +#[cfg(target_arch = "x86_64")] use std::arch::x86_64::*; use crate::*; #[cfg(target_feature = "avx2")] -pub use crate::avx2::generator::*; +use crate::avx2::generator::*; #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), not(target_feature = "avx2")))] -pub use crate::sse42::generator::*; +use crate::sse42::generator::*; #[cfg(target_feature = "neon")] -pub use crate::neon::generator::*; +use crate::neon::generator::*; const QU: u8 = b'"'; const BS: u8 = b'\\'; @@ -37,7 +37,7 @@ const UU: u8 = b'u'; const __: u8 = 0; // Look up table for characters that need escaping in a product string -static ESCAPED: [u8; 256] = [ +pub(crate) static ESCAPED: [u8; 256] = [ // 0 1 2 3 4 5 6 7 8 9 A B C D E F UU, UU, UU, UU, UU, UU, UU, UU, BB, TT, NN, UU, FF, RR, UU, UU, // 0 UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, // 1 @@ -113,8 +113,7 @@ pub trait BaseGenerator { // quote characters that gives us a bitmask of 0x1f for that // region, only quote (`"`) and backslash (`\`) are not in // this range. - stry!(self.process_32_bytes(&mut string, &mut len, &mut idx)); - stry!(self.process_32_bytes(&mut string, &mut len, &mut idx)); + stry!(write_str_simd(self.get_writer(), &mut string, &mut len, &mut idx)); } // Legacy code to handle the remainder of the code for (index, ch) in string.iter().enumerate() { @@ -127,12 +126,6 @@ pub trait BaseGenerator { self.write_char(b'"') } - // 32-byte generation implementation - process_32_bytes!(); - - // 16-byte generation implementation - process_16_bytes!(); - #[inline(always)] fn write_float(&mut self, num: f64) -> io::Result<()> { let mut buffer = ryu::Buffer::new();

that is amazing 🚀 ... integrating!

src/value/generator.rs

src/neon/stage1.rs

…, need some advice there)

sunnygleason · 2019-08-16T17:44:48Z

@Licenser ok this is weird, arm64 passes for me in EC2 but not on drone. Are you able to repro the test failure?

src/neon/stage1.rs

sunnygleason · 2019-08-16T19:33:09Z

@Licenser success!

Licenser · 2019-08-16T19:45:35Z

We are getting there :D

Licenser · 2019-08-16T19:46:15Z

Now I just need to get the last 5 intrinsics working :/ I split the MR in the hope that we can get 90% merged while I try to wrangle the remaining ones.

src/value/generator.rs

src/sse42/generator.rs

src/neon/generator.rs

src/avx2/generator.rs

sunnygleason · 2019-08-16T20:13:34Z

@Licenser ok, hopefully it works! :)

Licenser

Incredible work! :D

@Licenser

* Put something in the readme so we can have a PR * Add drone file * update build status * unguard for sse4.2 to allow rust to polyfill on older platforms * Add more simd tests * RFC: Neon support (pretty much working) (#35) * feat: neon support * feat: temp stub replacements for neon intrinsics (pending rust-lang/stdarch#792) * fix: drone CI rustup nightly * feat: fix guards, use rust stdlib for bit count operations * fix: remove double semicolon * feat: fancy generic generator functions, thanks @Licenser * Update extq intrinsics * Use simd-lite (#39) * Use simd-lite * Update badge * Update badge * Get rid of transmutes * Use NeonInit trait * vqsubq_u8 fix * vqsubq_u8 fix pt. 2 * use reexprted values from simd-lite * add simd-lite real version