-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Neon support (pretty much working) #35
Conversation
That’s a great start! |
We should have all the intrinsics (somewhat working) now! :D some are still brokenish however |
@Licenser I've uploaded some more work -- looking forward to your feedback! The intrinsics are mostly self-contained in |
Did a quick pass over your comments :) it's a bit late here so I'll take a closer look tomorrow. |
@Licenser I think we're getting closer. My main issue now is how to get around these errors:
Overall, seems like we're getting closer... |
@Licenser ok, have it mostly building. Now I'm hitting issues in the compiler/link step I believe:
|
So on a aarch64 cpu this compiles! |
@Licenser I think I'm at a point where I could use some help. I've cross-checked the implementation with the C++ version -- things seem mostly ok, but still only 21 tests pass. It would be great to have another set of eyes check this out! Thank you so much for your help. |
Sure! There is a real chance that there is a bug in the intrinsics, who knows if I messed something up there :P it's not merged yet and for good reasons. So I did a bit of the good old println debugging (or With the input Before the call:
structurals is 7 (correct) 4 + 2 + 1 (first three bits, the first 3 characters are structurals, that's good!) index: 64 since we had input length = 3 and we add 64 (we process that at a time) to the input (this might be where it geos wrong? but it's the same on avx and sse, so it should be right) After the call:
this is after the call, indexes are 0, 64, 2 and 1. I suspect that is where we go oops, it should be 0, 1, 2, 3. I'll take a peak at |
@Licenser that makes sense, I'm seeing the same thing - bad character at index 64. I have augmented my code with println debugging so I will be able to get back to it later this evening -- in the meantime, thank you so much for your help! :) |
Not everything is working but the following diff makes quite a difference: diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs
index a1e15a1..2af15e6 100644
--- a/src/neon/stage1.rs
+++ b/src/neon/stage1.rs
@@ -332,7 +332,7 @@ unsafe fn flatten_bits(base: &mut Vec<u32>, idx: u32, mut bits: u64) {
let v3: i32 = static_cast_i32!(trailingzeroes(bits));
bits &= bits.wrapping_sub(1);
- let v: int32x4_t = int32x4_t::new(v3, v2, v1, v0);
+ let v: int32x4_t = int32x4_t::new(v0, v1, v2, v3);
let v: int32x4_t = vaddq_s32(idx_64_v, v);
std::ptr::write(base.as_mut_ptr().add(l) as *mut int32x4_t, v); |
here is another one: diff --git a/src/numberparse.rs b/src/numberparse.rs
index 425d16a..b44d320 100644
--- a/src/numberparse.rs
+++ b/src/numberparse.rs
@@ -133,10 +133,10 @@ pub enum Number {
fn parse_eight_digits_unrolled(chars: &[u8]) -> u32 {
let val: u64 = unsafe { *(chars.as_ptr() as *const u64) };
// memcpy(&val, chars, sizeof(u64));
- let val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
- let val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16;
+ let val = (val & 0x0F0F0F0F0F0F0F0F).wrapping_mul(2561) >> 8;
+ let val = (val & 0x00FF00FF00FF00FF).wrapping_mul(6553601) >> 16;
- return ((val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32) as u32;
+ return ((val & 0x0000FFFF0000FFFF).wrapping_mul(42949672960001) >> 32) as u32;
}
impl<'de> Deserializer<'de> { With those two we're up to 66 passing tests. That said we shouldn't generally fall back to the slow parse_eight_digits_unrolled, it should be architecture specific I think. |
The remaining failing tests are:
I gut feeling says it's unicode/utf8 related as all ood_* cases also include silly UTF8 strings. |
That's awesome. The utf8 stuff has a lot of weirdness with 'u8' vs 'i8' vs 's8' and reinterpretation, so I'm not totally surprised. It could be reinterpretation issue, or me slightly modifying intrinsic(s). Pretty awesome to be at 66 passing tests, nice work! |
(Also, I can confirm the 66 passing tests in my ARM environment) |
All credit to you on the work I just swapped a |
f544f03
to
faea015
Compare
OK, this should be better!
|
src/value/generator.rs
Outdated
#[cfg(target_feature = "avx2")] | ||
const AVX2_PRESENT : bool = true; | ||
#[cfg(not(target_feature = "avx2"))] | ||
const AVX2_PRESENT : bool = false; | ||
#[cfg(target_feature = "sse4.2")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are not needed, we don't need to do runtime checks. you can simply put the whole thing in a block and guard the block with a #[cfg]
directive :) that'll prevent us from having to do it over and over again.
The compiler might (should) optimize a if false away but having the #[cfg]
guard around the code that isn't executed makes it clearer when reading the code that it's something that happens at compile not at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use macros; the trait modeling is slightly beyond my experience level since we're going outside the current module, advice welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a diff at the end of this but I'll walk through the theory first:
So rust is quite powerful when it comes to generics and the dependant compilation of whole module trees.
We'll use that to our advantage to make the implementation zero overhead (and I don't mean nearly zero overhead but just have it slot in where it is requred at no cost at runtime not even a nop'ed out funciton call).
1st: We call the function the same in all implementations. This allows us to use a dependant import the same way we use it for the stage1 and other modules.
2nd: We pull the implementation out of the trait and make it generic over the writer. Generic here doesn't mean dynamic dispatch. Rust will compile the function for each generic variant that is used. In our case we make it generic over the writer (called W
in the implementation).
3rd: NEON case (we ca reuse the neon_movemask
from stage1 - I think this eventually should go into a file of shared functions but I wanted to keep the diff small.
4th: follow the same step for sse and avx.
result: No macros, no runtime overhead. Yay!
The whole rust generics, especially when mixed with dependant compilation and multiple architectures, take a some time to wrap your head around :D it's one of those things that just require you to do run in a wall 15000 times until it clicks!
diff --git a/src/neon/generator.rs b/src/neon/generator.rs
index 7e0c8b6..7c9eaad 100644
--- a/src/neon/generator.rs
+++ b/src/neon/generator.rs
@@ -1,32 +1,10 @@
+use crate::value::generator::ESCAPED;
+use std::io;
+use crate::neon::intrinsics::*;
+use crate::neon::stage1::neon_movemask;
-#[macro_export]
-macro_rules! process_32_bytes {
- () => {
#[inline(always)]
- unsafe fn process_32_bytes(&mut self, _string: &mut &[u8], _len: &mut usize, _idx: &mut usize) -> io::Result < () > {
- Ok(())
- }
- };
-}
-
-#[macro_export]
-macro_rules! process_16_bytes {
- () => {
- #[inline(always)]
- unsafe fn process_16_bytes(&mut self, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> {
- #[cfg_attr(not(feature = "no-inline"), inline(always))]
- unsafe fn __neon_movemask(input: uint8x16_t) -> u16 {
- let bit_mask = uint8x16_t::new(
- 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
- 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
- );
- let minput: uint8x16_t = vandq_u8(input, bit_mask);
- let tmp: uint8x16_t = vpaddq_u8(minput, minput);
- let tmp = vpaddq_u8(tmp, tmp);
- let tmp = vpaddq_u8(tmp, tmp);
-
- vgetq_lane_u16(vreinterpretq_u16_u8(tmp), 0)
- }
+ pub unsafe fn write_str_simd<W>(writer: &mut W, string: &mut &[u8], len: &mut usize, idx: &mut usize) -> io::Result<()> where W: std::io::Write {
// The case where we have a 16+ byte block
// we repeate the same logic as above but with
@@ -48,15 +26,15 @@ macro_rules! process_16_bytes {
// now.
let is_unchanged = vxorrq_u8(data, in_quote_range);
let in_range = vceqq_u8(is_unchanged, zero);
- let quote_bits = __neon_movemask(vorrq_u8(bs_or_quote, in_range));
+ let quote_bits = neon_movemask(vorrq_u8(bs_or_quote, in_range));
if quote_bits != 0 {
let quote_dist = quote_bits.trailing_zeros() as usize;
- stry!(self.get_writer().write_all(&string[0..*idx + quote_dist]));
+ stry!(writer.write_all(&string[0..*idx + quote_dist]));
let ch = string[*idx + quote_dist];
match ESCAPED[ch as usize] {
- b'u' => stry!(write!(self.get_writer(), "\\u{:04x}", ch)),
+ b'u' => stry!(write!(writer, "\\u{:04x}", ch)),
- escape => stry!(self.write(&[b'\\', escape])),
+ escape => stry!(writer.write_all(&[b'\\', escape])),
};
*string = &string[*idx + quote_dist + 1..];
*idx = 0;
@@ -65,9 +43,7 @@ macro_rules! process_16_bytes {
*idx += 16;
}
}
- stry!(self.get_writer().write_all(&string[0..*idx]));
+ stry!(writer.write_all(&string[0..*idx]));
*string = &string[*idx..];
Ok(())
}
- };
-}
\ No newline at end of file
diff --git a/src/neon/stage1.rs b/src/neon/stage1.rs
index 0f00857..45322ce 100644
--- a/src/neon/stage1.rs
+++ b/src/neon/stage1.rs
@@ -18,7 +18,7 @@ macro_rules! bit_mask {
}
#[cfg_attr(not(feature = "no-inline"), inline(always))]
-unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
+pub(crate) unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
let minput: uint8x16_t = vandq_u8(input, bit_mask!());
let tmp: uint8x16_t = vpaddq_u8(minput, minput);
let tmp = vpaddq_u8(tmp, tmp);
@@ -28,7 +28,7 @@ unsafe fn neon_movemask(input: uint8x16_t) -> u16 {
}
#[cfg_attr(not(feature = "no-inline"), inline(always))]
-unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 {
+pub unsafe fn neon_movemask_bulk(p0: uint8x16_t, p1: uint8x16_t, p2: uint8x16_t, p3: uint8x16_t) -> u64 {
let bit_mask = bit_mask!();
let t0 = vandq_u8(p0, bit_mask);
diff --git a/src/value.rs b/src/value.rs
index 0624587..c5d5aa8 100644
--- a/src/value.rs
+++ b/src/value.rs
@@ -11,7 +11,7 @@
/// we do not require prior knowledge sbout string comtent to to take advantage
/// of it.
pub mod borrowed;
-mod generator;
+pub(crate) mod generator;
pub mod owned;
pub use self::borrowed::{to_value as to_borrowed_value, Value as BorrowedValue};
diff --git a/src/value/generator.rs b/src/value/generator.rs
index b053344..007acc1 100644
--- a/src/value/generator.rs
+++ b/src/value/generator.rs
@@ -12,19 +12,19 @@ use std::ptr;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
-#[cfg(any(target_feature = "sse4.2", target_feature = "avx2"))]
+#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
use crate::*;
#[cfg(target_feature = "avx2")]
-pub use crate::avx2::generator::*;
+use crate::avx2::generator::*;
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), not(target_feature = "avx2")))]
-pub use crate::sse42::generator::*;
+use crate::sse42::generator::*;
#[cfg(target_feature = "neon")]
-pub use crate::neon::generator::*;
+use crate::neon::generator::*;
const QU: u8 = b'"';
const BS: u8 = b'\\';
@@ -37,7 +37,7 @@ const UU: u8 = b'u';
const __: u8 = 0;
// Look up table for characters that need escaping in a product string
-static ESCAPED: [u8; 256] = [
+pub(crate) static ESCAPED: [u8; 256] = [
// 0 1 2 3 4 5 6 7 8 9 A B C D E F
UU, UU, UU, UU, UU, UU, UU, UU, BB, TT, NN, UU, FF, RR, UU, UU, // 0
UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, UU, // 1
@@ -113,8 +113,7 @@ pub trait BaseGenerator {
// quote characters that gives us a bitmask of 0x1f for that
// region, only quote (`"`) and backslash (`\`) are not in
// this range.
- stry!(self.process_32_bytes(&mut string, &mut len, &mut idx));
- stry!(self.process_32_bytes(&mut string, &mut len, &mut idx));
+ stry!(write_str_simd(self.get_writer(), &mut string, &mut len, &mut idx));
}
// Legacy code to handle the remainder of the code
for (index, ch) in string.iter().enumerate() {
@@ -127,12 +126,6 @@ pub trait BaseGenerator {
self.write_char(b'"')
}
- // 32-byte generation implementation
- process_32_bytes!();
-
- // 16-byte generation implementation
- process_16_bytes!();
-
#[inline(always)]
fn write_float(&mut self, num: f64) -> io::Result<()> {
let mut buffer = ryu::Buffer::new();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is amazing 🚀 ... integrating!
…, need some advice there)
@Licenser ok this is weird, arm64 passes for me in EC2 but not on drone. Are you able to repro the test failure? |
@Licenser success! |
We are getting there :D |
Now I just need to get the last 5 intrinsics working :/ I split the MR in the hope that we can get 90% merged while I try to wrangle the remaining ones. |
@Licenser ok, hopefully it works! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredible work! :D
* Put something in the readme so we can have a PR * Add drone file * update build status * unguard for sse4.2 to allow rust to polyfill on older platforms * Add more simd tests * RFC: Neon support (pretty much working) (#35) * feat: neon support * feat: temp stub replacements for neon intrinsics (pending rust-lang/stdarch#792) * fix: drone CI rustup nightly * feat: fix guards, use rust stdlib for bit count operations * fix: remove double semicolon * feat: fancy generic generator functions, thanks @Licenser * Update extq intrinsics * Use simd-lite (#39) * Use simd-lite * Update badge * Update badge * Get rid of transmutes * Use NeonInit trait * vqsubq_u8 fix * vqsubq_u8 fix pt. 2 * use reexprted values from simd-lite * add simd-lite real version
Hello hello!
I have been pulling some of your Neon intrinsics and porting the
simdjson neon code. Maybe it's useful! I'll keep improving it...
Comments welcome anytime!
All the best,
-Sunny