-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split out arrow-buffer crate (#2594) #2693
Conversation
|
||
#[inline] | ||
unsafe fn null_pointer<T: NativeType>() -> NonNull<T> { | ||
NonNull::new_unchecked(ALIGNMENT as *mut T) | ||
unsafe fn null_pointer() -> NonNull<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a breaking change, it appears to relate to early experiments by @jorgecarleitao r.e. arrow2. It isn't used anywhere, and so I think can just go. The removal is so that arrow-buffer doesn't depend on DataType
@@ -38,15 +38,15 @@ fn create_buffer(size: usize) -> Buffer { | |||
} | |||
|
|||
fn bench_buffer_and(left: &Buffer, right: &Buffer) { | |||
criterion::black_box((left & right).unwrap()); | |||
criterion::black_box(buffer_bin_and(left, 0, right, 0, left.len() * 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other breaking change, I opted to remove the BitAnd
, etc... implementations for Buffer
for two reasons:
- They were very easy to accidentally forget to apply the offset
- They need an error enumeration, and defining an error type in arrow-buffer solely for this case seemed overkill
Some(right_bitmap) => { | ||
// NOTE: right values and bitmaps are combined and stay at bit offset right.offset() | ||
(right.values() & &right_bitmap.bits).ok().map(|b| b.not()) | ||
let and = buffer_bin_and( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is not only simpler now, but avoids performing work on values past the offset
mod immutable; | ||
pub use immutable::*; | ||
mod mutable; | ||
pub use mutable::*; | ||
mod ops; | ||
mod scalar; | ||
pub use scalar::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if it makes sense to just collapse the /buffer/
folder to top level given arrow_buffer
crate name gives that level of indirection already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as the crate still contains other public modules, like alloc, it would be confusing to lift the contents of buffer to the top-level
arrow-buffer/src/alloc/mod.rs
Outdated
unsafe { | ||
if size == 0 { | ||
null_pointer() | ||
} else { | ||
let size = size * size_of::<T>(); | ||
let size = size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let size = size; |
} | ||
|
||
/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values. | ||
/// This is more performant than using [allocate_aligned_zeroed] when all bytes will have | ||
/// an unknown or non-zero value and is semantically similar to `malloc`. | ||
pub fn allocate_aligned<T: NativeType>(size: usize) -> NonNull<T> { | ||
pub fn allocate_aligned(size: usize) -> NonNull<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the caller needs to compute the allocation size by the type T they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although nothing was actually using this API to allocate anything other than bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it was also related to a time when there were fewer things that were ArrowNativeType (I can't remember but that may have changed over time)
unsafe fn null_pointer<T: NativeType>() -> NonNull<T> { | ||
NonNull::new_unchecked(ALIGNMENT as *mut T) | ||
unsafe fn null_pointer() -> NonNull<u8> { | ||
NonNull::new_unchecked(ALIGNMENT as *mut u8) | ||
} | ||
|
||
/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like previously the doc is not totally correct as size
will be multiplied by size of T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually I don't see any allocate_aligned usage with types other than u8 so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I believe Jorge opted to fork off arrow2 before getting further with integrating this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It is very exciting to see this happening
I suggest we should run some benchmarks if we haven't already done so to ensure this doesn't cause performance regressions (due to different cross crate inlining rules or something else silly)
I have also added a note to #2665 that we will have to update the release process as we add some new crates
} | ||
|
||
/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values. | ||
/// This is more performant than using [allocate_aligned_zeroed] when all bytes will have | ||
/// an unknown or non-zero value and is semantically similar to `malloc`. | ||
pub fn allocate_aligned<T: NativeType>(size: usize) -> NonNull<T> { | ||
pub fn allocate_aligned(size: usize) -> NonNull<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it was also related to a time when there were fewer things that were ArrowNativeType (I can't remember but that may have changed over time)
third_offset_in_bits: usize, | ||
fourth: &Buffer, | ||
fourth_offset_in_bits: usize, | ||
pub fn bitwise_quaternary_op_helper<F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should profile with this change? It seems like it should not be any different given that buffers
and offsets
are sized (and thus the compiler can check and elide the bounds checks)
} | ||
Ok(Bitmap::from(buffer_bin_and( | ||
&self.bits, | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this does the same things as &
(uses offset zero) but I wonder if that is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment applies to the other operators as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BitMap currently doesn't have an offset, this does mean that practically this implementation is likely being used incorrectly in places, but I opted to just preserve the existing behaviour. I agree it is a bit of a footgun #1802
BTW I plan to test this change out against DataFusion |
Seems to work well 👍 diff --git a/Cargo.toml b/Cargo.toml
index b5a7989d9..713c4844b 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -37,6 +37,6 @@ lto = true
# TEMP
[patch.crates-io]
-arrow = { git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92"}
-parquet = { git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92" }
-arrow-flight = { git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92" }
+arrow = { git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878"}
+parquet = { git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878" }
+arrow-flight = { git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878" } |
Sigh... Time to play LLVM vectorisation bingo again... |
Maybe it is time to sprinkle some |
If only it were that simple 😢 Edit: Plan to come back to this tomorrow morning, LLVM is not playing ball |
FWIW, I'm very in favor of this splitting up into smaller crates. We use a small abstraction layer over arrow buffers or plain vectors and this could reduce the compilation times for that usecase significantly. Regarding benchmarks, do we actually run these with LTO? My current understanding is that without LTO the |
At least in theory LTO is only needed for cross-crate inlining of non-inline annotated functions, and even then generic functions should be inlined anyway as a consequence of monomorphisation. The simple fix is just to move collect_bool into the comparison kernels, but I'm curious to understand what is going on as it may have implications down the line |
@@ -395,15 +395,15 @@ impl MutableBuffer { | |||
/// as it eliminates the conditional `Iterator::next` | |||
#[inline] | |||
pub fn collect_bool<F: FnMut(usize) -> bool>(len: usize, mut f: F) -> Self { | |||
let mut buffer = Self::new(bit_util::ceil(len, 8)); | |||
let mut buffer = Self::new(bit_util::ceil(len, 64) * 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it faster than it was before
eq Float32 time: [9.2538 µs 9.2627 µs 9.2720 µs]
change: [-50.278% -50.193% -50.080%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
eq scalar Float32 time: [7.2924 µs 7.2962 µs 7.3004 µs]
change: [-7.5896% -7.5335% -7.4762%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
neq Float32 time: [9.1950 µs 9.1992 µs 9.2043 µs]
change: [-50.651% -50.607% -50.563%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe
neq scalar Float32 time: [7.2907 µs 7.2949 µs 7.2997 µs]
change: [-7.3374% -7.2597% -7.1685%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
lt Float32 time: [9.2100 µs 9.2180 µs 9.2269 µs]
change: [-50.470% -50.359% -50.234%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
lt scalar Float32 time: [7.2604 µs 7.2638 µs 7.2684 µs]
change: [-8.4085% -8.2278% -7.9735%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
lt_eq Float32 time: [9.2350 µs 9.2410 µs 9.2472 µs]
change: [-50.679% -50.603% -50.530%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe
lt_eq scalar Float32 time: [7.2869 µs 7.2906 µs 7.2946 µs]
change: [-7.8178% -7.7122% -7.6141%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
gt Float32 time: [9.1731 µs 9.1783 µs 9.1844 µs]
change: [-50.586% -50.501% -50.367%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
gt scalar Float32 time: [7.2970 µs 7.2996 µs 7.3026 µs]
change: [-7.6238% -7.4831% -7.2758%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
gt_eq Float32 time: [9.1925 µs 9.1962 µs 9.2004 µs]
change: [-50.274% -50.223% -50.176%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe
And also makes it less reliant on the inlining whims of LLVM. This PR does still represent an ~10% performance hit versus this change applied to master, but I can live with that. We are talking 1 microsecond 😅
Docs failure is related to rust-lang/rust#101844, it runs fine on an older nightly |
Benchmark runs are scheduled for baseline = 7594db6 and contender = fb01656. fb01656 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #2594.
Rationale for this change
Begins the process of gradually exploding this crate, to help with compile times, codesize, etc...
What changes are included in this PR?
Splits the arrow buffer types into their own crate. This is mostly mechanical but there are some breaking changes
Are there any user-facing changes?
Yes, hopefully none that are controversial