-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace RawPtrBox with ScalarBuffer, reduce unsafe
usage (#1811)
#1825
Conversation
Fix value_offsets for empty variable length arrays (apache#1824)
self.value_data.as_ptr().offset(start.to_isize().unwrap()), | ||
(end - start).to_usize().unwrap(), | ||
) | ||
let start = self.value_offsets.get_unchecked(i).to_isize().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this to_isize()
dance, really I just want ArrowNativeType
to provide AsPrimitive
so that we can do an unchecked cast. Will file a follow up PR
@@ -940,7 +938,7 @@ mod tests { | |||
} | |||
|
|||
#[test] | |||
#[should_panic(expected = "index out of bounds: the len is 10 but the index is 11")] | |||
#[should_panic(expected = "index out of bounds: the len is 9 but the index is 10")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index changes because we check the length of the array, not the end offset first
Codecov Report
@@ Coverage Diff @@
## master #1825 +/- ##
==========================================
- Coverage 83.45% 83.44% -0.01%
==========================================
Files 200 199 -1
Lines 56697 56685 -12
==========================================
- Hits 47315 47301 -14
- Misses 9382 9384 +2
Continue to review full report at Codecov.
|
unsafe
usage (#1811)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great @tustvold -- thank you.
The type cast encapsulation of ScalarBuffer
seems just 👨🍳 👌 and I find this code much easier to follow than the previous calculations involving RawPtrBox
I think it is worth running some benchmarks here too to make sure this doesn't accidentally make performance worse. I wouldn't expect any change, but I do think it would be worth double checking.
@@ -64,67 +64,40 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> { | |||
|
|||
/// Returns a clone of the value data buffer | |||
pub fn value_data(&self) -> Buffer { | |||
self.data.buffers()[1].clone() | |||
self.value_data.clone() | |||
} | |||
|
|||
/// Returns the offset values in the offsets buffer | |||
#[inline] | |||
pub fn value_offsets(&self) -> &[OffsetSize] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -39,8 +39,8 @@ use crate::{buffer::MutableBuffer, datatypes::DataType}; | |||
/// binary data. | |||
pub struct GenericBinaryArray<OffsetSize: OffsetSizeTrait> { | |||
data: ArrayData, | |||
value_offsets: RawPtrBox<OffsetSize>, | |||
value_data: RawPtrBox<u8>, | |||
value_offsets: ScalarBuffer<OffsetSize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if calling ScalarBuffer
something like TypedBuffer
would make its purpose slightly clearer 🤔
let start = self.value_offsets.get_unchecked(i).to_isize().unwrap(); | ||
let end = self.value_offsets.get_unchecked(i + 1).to_isize().unwrap(); | ||
|
||
// buffer bounds/offset is ensured by the value_offset invariants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and to be clear, this method is unsafe
so getting unchecked values seems like it would be ok...
@@ -573,6 +566,18 @@ mod tests { | |||
.expect("All null array has valid array data"); | |||
} | |||
|
|||
#[test] | |||
fn test_string_array_empty_offsets() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -1,67 +0,0 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 👍
Benchmarks show it making a fairly non-trivial difference to the following kernels:
I need to look a bit more into what is going on here |
Looks good, nice improvement. Regarding the benchmarks, maybe the |
Sadly adding |
I may revisit this later, but it isn't a priority right now |
@@ -100,17 +92,15 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> { | |||
/// caller must ensure that the passed in offset is less than the array len() | |||
#[inline] | |||
pub unsafe fn value_unchecked(&self, i: usize) -> T::Native { | |||
let offset = i + self.offset(); | |||
*self.raw_values.as_ptr().add(offset) | |||
*self.raw_values.get_unchecked(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value_unchecked
seems more costly than before? Previously it simply takes value from the ptr (with an offset).
But get_unchecked
on ScalarBuffer
will create a slice first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slice creation is unchecked, so they should compile down to the same thing. Whether they do though...
Fix value_offsets for empty variable length arrays (#1824)
Which issue does this PR close?
Closes #1811
Closes #1824
Rationale for this change
The previous logic was more unsafe, and in fact implementing this PR found a soundness bug in the form of #1824, so already got some ROI 🎉
What changes are included in this PR?
Removes RawPtrBox and changes to using ScalarBuffer
Are there any user-facing changes?
Yes,
value_offsets
will now always return an empty slice for zero-length lists