-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ArrayData from Array (#3880) #4061
Remove ArrayData from Array (#3880) #4061
Conversation
@@ -1107,94 +1107,29 @@ fn get_buffer_element_width(spec: &BufferSpec) -> usize { | |||
} | |||
} | |||
|
|||
/// Returns byte width for binary value_offset buffer spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic had to be reworked to not rely on a non-zero ArrayData::offset to trigger the re-encoding logic
assert_eq!(3, sliced_array.null_count()); | ||
|
||
for i in 0..sliced_array.len() { | ||
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) { | ||
if bit_util::get_bit(&null_bits, 1 + i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary because the array no longer "keeps" the offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above I think maybe it would make sense to at least deprecate Array::offset
as it is always 0 now (and as part of the deprecation message explain why)
@@ -184,19 +164,15 @@ pub trait Array: std::fmt::Debug + Send + Sync { | |||
/// # Example: | |||
/// | |||
/// ``` | |||
/// use arrow_array::{Array, Int32Array}; | |||
/// use arrow_array::{Array, BooleanArray}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to BooleanArray which still "preserves" the offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should deprecate this API for the reasons explained abve
#[should_panic( | ||
expected = "FixedSizeListArray child array length should be a multiple of 3" | ||
)] | ||
#[should_panic(expected = "assertion failed: (offset + length) <= self.len()")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a sanity check, and implies an invalid ArrayData, I'm not sure the complexity of having a nicer error message is worth it
if *len > 0 { | ||
// check that child data is multiple of length | ||
assert_eq!( | ||
values.len() % *len as usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is redundant with the slice below
assert_eq!( | ||
data.buffers().len(), | ||
0, | ||
"FixedSizeListArray data should not contain a buffer for value offsets" | ||
); | ||
assert_eq!( | ||
data.child_data().len(), | ||
1, | ||
"FixedSizeListArray should contain a single child array (values array)" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are redundant as they would imply an invalid ArrayData
@@ -566,9 +615,9 @@ mod tests { | |||
fixed_size_binary_array.value(1) | |||
); | |||
assert_eq!(2, fixed_size_binary_array.len()); | |||
assert_eq!(5, fixed_size_binary_array.value_offset(0)); | |||
assert_eq!(0, fixed_size_binary_array.value_offset(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slice pushdown results in this changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive @tustvold -- some said it couldn't be done but seeing your string of incremental PRs over the last few months to remove ArrayData with the strongly typed abstractions has been inspirational
I know it seemingly took longer than it might have without the constraints of existing users, I think this change will be that much more valuable over the long term.
So thank you again
self.values.is_empty() | ||
} | ||
|
||
/// Returns a zero-copy slice of this array with the indicated offset and length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Self { | ||
data_type: self.data_type.clone(), | ||
keys: self.keys.slice(offset, length), | ||
values: self.values.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the values aren't sliced because the keys may still refer to any logical value in the dictionary
fn nulls(&self) -> Option<&NullBuffer> { | ||
self.dictionary.nulls() | ||
} | ||
|
||
fn get_buffer_memory_size(&self) -> usize { | ||
self.dictionary.get_buffer_memory_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this function and the one below it also account for the self.values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.values
will get counted by self.dictionary
, self.values
is just a downcast borrow of the inner array
} | ||
|
||
/// Returns the length for an element. | ||
/// | ||
/// All elements have the same length as the array is a fixed size. | ||
#[inline] | ||
pub fn value_length(&self) -> i32 { | ||
self.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed length
to value_length
to avoid confusion with len
self.len == 0 | ||
} | ||
|
||
fn offset(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning of offset
in this new world where the offset is stored on the underlying typed array? Maybe we should remove it from the Array trait 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's certainly doesn't have clear semantics anymore. As this PR is already quite large I'll remove it in a follow on PR
assert_eq!(3, sliced_array.null_count()); | ||
|
||
for i in 0..sliced_array.len() { | ||
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) { | ||
if bit_util::get_bit(&null_bits, 1 + i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above I think maybe it would make sense to at least deprecate Array::offset
as it is always 0 now (and as part of the deprecation message explain why)
fn get_buffer_memory_size(&self) -> usize { | ||
self.data.get_buffer_memory_size() | ||
let mut size = self.entries.get_buffer_memory_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other fields like keys
, values
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keys
and values
are stored within entries
, they're the children of the StructArray
. Perhaps it would be worth making entries
StructArray 🤔
@@ -184,19 +164,15 @@ pub trait Array: std::fmt::Debug + Send + Sync { | |||
/// # Example: | |||
/// | |||
/// ``` | |||
/// use arrow_array::{Array, Int32Array}; | |||
/// use arrow_array::{Array, BooleanArray}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should deprecate this API for the reasons explained abve
// TODO: Slice buffers directly (#3880) | ||
self.data.slice(offset, length).into() | ||
let (offsets, fields) = match self.offsets.as_ref() { | ||
Some(offsets) => (Some(offsets.slice(offset, length)), self.fields.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by this at first -- some comments here (and/or in the self.offsets
I think would help
Specifically I think if self.offsets
is SOme(..)
that means DenseUnion
and the offsets point to the correct children so slicing the offsets but not the actual children makes sense
if self.offsets
is None, then it is a sparse union, so all the children line up and we need to slice them individually.
I had to re-read https://arrow.apache.org/docs/format/Columnar.html#union-layout to make sure
_ => unreachable!(), | ||
} | ||
} | ||
let offsets = match start_offset.as_usize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add some comments about what this is doing (namely that it is "rebasing" the offsets to all be zero based)? That would be easy to overlook and was somewhat encoded in the method names previously
As part of the completion of this project, @tustvold could you write up some description of the relationship between Buffer / PrimitveBuffer / ArrayData, etc? Perhaps we can take inspiration (or copy/modify) the guide from: If you think it makes sense I will file a ticket |
I think this is definitely valuable, and will mesh nicely with #3879 |
Filed #4071 |
Which issue does this PR close?
Closes #3880
Rationale for this change
See ticket
What changes are included in this PR?
Are there any user-facing changes?
Yes, this inherently changes the way that slicing behaves
It also adds a statically typed BooleanArray::slice that somehow was missed in #3930