-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encapsulate View
logic for GenericByteViewArray
#5619
Conversation
View
logic in StringViewArray
View
logic in StringViewArray
View
logic for GenericByteViewArray
/// | ||
/// ```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to View
if !flushed.is_empty() { | ||
assert!(self.completed.len() < u32::MAX as usize); | ||
self.completed.push(flushed.into()); | ||
let view: u128 = match OwnedView::from(v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the construction is now much clearer and encapsulated -- the bit manipulation is handled by OwnedView
and OffsetViewBuilder
arrow-data/src/byte_view.rs
Outdated
} | ||
impl<'a> From<&'a u128> for View<'a> { | ||
#[inline(always)] | ||
fn from(v: &'a u128) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key API for a View
-- dispatching to the correct variant
} | ||
} | ||
|
||
/// A view for data where the variable length data has 12 or fewer bytes. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR is merged, I will make a follow on PR to remove the remaining uses of ByteView and deprecate this struct
@@ -178,13 +178,17 @@ fn build_extend_view(array: &ArrayData, buffer_offset: u32) -> Extend { | |||
mutable | |||
.buffer1 | |||
.extend(views[start..start + len].iter().map(|v| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another example of how the magic 12 number gets extracted out
/// # Notes | ||
/// Equality is based on the bitwise value of the view, not the data it logically points to | ||
#[derive(Debug, Copy, Clone, PartialEq)] | ||
pub enum View<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need into_owned
that return OwnedView
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fe510ef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear and neat
Thanks for the review @ariesdevil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128
is copy.
However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all. This makes me wonder if just a simple ByteView(u128)
would suffice to encapsulate the short-string logic, or something...
I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅
Edit: I created #5735 which might provide a simpler way to achieve the same end
/// # Notes | ||
/// Equality is based on the bitwise value of the view, not the data it logically points to | ||
#[derive(PartialEq)] | ||
pub enum OwnedView { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this type adding, it feels like it isn't entirely necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its usecase is to create a view from &str
/ &[u8]
and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation
/// # Notes | ||
/// Equality is based on the bitwise value of the view, not the data it logically points to | ||
#[derive(Debug, Copy, Clone, PartialEq)] | ||
pub enum View<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be ByteView
to avoid confusion with the list view types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already another struct named ByteView
in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView
pub fn as_u128(self) -> &'a u128 { | ||
self.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn as_u128(self) -> &'a u128 { | |
self.0 | |
pub fn as_u128(self) -> u128 { | |
*self.0 |
/// Equality is based on the bitwise value of the view, not the data it | ||
/// logically points to | ||
#[derive(Copy, Clone, PartialEq)] | ||
pub struct InlineView<'a>(&'a u128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we even need this borrow, u128
is copy and removing the indirection might help LLVM not be stupid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128 is copy.
That was to encapsulate copying the inlined bytes and retain the information about which type it was (without having to check the length again)
However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all.
I don't understand this assertion. In my mind the whole point of the View abstraction is to encapsulate the check for length as the two different variants need different handling. For special cases like prefix comparsion we could certainly add specialized functions like View::compare_prefix
I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅
I think the kernels and usage patterns we have are sufficient. There are several other uses of ByteView that I didn't port over in this PR to keep it smaller (such as
arrow-rs/arrow-data/src/equal/byte_view.rs
Lines 20 to 70 in 4045fb5
pub(super) fn byte_view_equal( | |
lhs: &ArrayData, | |
rhs: &ArrayData, | |
lhs_start: usize, | |
rhs_start: usize, | |
len: usize, | |
) -> bool { | |
let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len]; | |
let lhs_buffers = &lhs.buffers()[1..]; | |
let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len]; | |
let rhs_buffers = &rhs.buffers()[1..]; | |
for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() { | |
// Only checking one null mask here because by the time the control flow reaches | |
// this point, the equality of the two masks would have already been verified. | |
if lhs.is_null(idx) { | |
continue; | |
} | |
let l_len_prefix = *l as u64; | |
let r_len_prefix = *r as u64; | |
// short-circuit, check length and prefix | |
if l_len_prefix != r_len_prefix { | |
return false; | |
} | |
let len = l_len_prefix as u32; | |
// for inline storage, only need check view | |
if len <= 12 { | |
if l != r { | |
return false; | |
} | |
continue; | |
} | |
// check buffers | |
let l_view = ByteView::from(*l); | |
let r_view = ByteView::from(*r); | |
let l_buffer = &lhs_buffers[l_view.buffer_index as usize]; | |
let r_buffer = &rhs_buffers[r_view.buffer_index as usize]; | |
// prefixes are already known to be equal; skip checking them | |
let len = len as usize - 4; | |
let l_offset = l_view.offset as usize + 4; | |
let r_offset = r_view.offset as usize + 4; | |
if l_buffer[l_offset..l_offset + len] != r_buffer[r_offset..r_offset + len] { | |
return false; | |
} | |
} | |
true |
Edit: I created #5735 which might provide a simpler way to achieve the same end
I don't understand how this solves the same problem but I probably don't fully understand it
/// # Notes | ||
/// Equality is based on the bitwise value of the view, not the data it logically points to | ||
#[derive(Debug, Copy, Clone, PartialEq)] | ||
pub enum View<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already another struct named ByteView
in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView
/// # Notes | ||
/// Equality is based on the bitwise value of the view, not the data it logically points to | ||
#[derive(PartialEq)] | ||
pub enum OwnedView { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its usecase is to create a view from &str
/ &[u8]
and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation
In order to move this PR forward, I plan to do the following:
|
I updated the description of this PR to make the rationale clearer. While encoding the number 12 isn't a huge deal in my mind, the core rationale for this PR does is to reduce the cognative overhead of working with While some people have the necessary time to invest understanding the lowest level representation, I think it is a barrier to contribution (as well as using the library). I have seen evidence of this challenge in a few examples such as #5557 to build up the While there might be other explanations, I think an abstraction like this will make ByteViewArrays much easier to work with |
Marking as a draft pending #5736 |
Draft PR as I would love to know if anyone has any comments about this idea / approachNote this looks like a large PR but it is mostly comments and tests
Which issue does this PR close?
Part of #5374
Rationale for this change
While reviewing #5557 from @ariesdevil -- it seemed like the
12
byte view length was getting hard coded many places and the same code is repeated over and over and makes the code harder to work with.While encoding the number 12 isn't a huge deal, what is a larger deal is the cognative overhead of working with ByteViewArrays while having to remember in your head the layouts of what the
u128
represents. Adding an abstraction makes this easier in my opinionWhat changes are included in this PR?
Encapsulate view manipulation in typesafe wrappers over
u128
View
enum that captures which view is usedOwnedView
andOwnedViewBuilder
for creating ViewsInlineView
andOffsetView
View
Are there any user-facing changes?
Several new types, but all existing types still pass