Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ArrayData from Array (#3880) #4061

Merged
merged 4 commits into from
Apr 12, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #3880

Rationale for this change

See ticket

What changes are included in this PR?

Are there any user-facing changes?

Yes, this inherently changes the way that slicing behaves

It also adds a statically typed BooleanArray::slice that somehow was missed in #3930

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 12, 2023
@@ -1107,94 +1107,29 @@ fn get_buffer_element_width(spec: &BufferSpec) -> usize {
}
}

/// Returns byte width for binary value_offset buffer spec.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic had to be reworked to not rely on a non-zero ArrayData::offset to trigger the re-encoding logic

assert_eq!(3, sliced_array.null_count());

for i in 0..sliced_array.len() {
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) {
if bit_util::get_bit(&null_bits, 1 + i) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary because the array no longer "keeps" the offset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)

@@ -184,19 +164,15 @@ pub trait Array: std::fmt::Debug + Send + Sync {
/// # Example:
///
/// ```
/// use arrow_array::{Array, Int32Array};
/// use arrow_array::{Array, BooleanArray};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to BooleanArray which still "preserves" the offset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate this API for the reasons explained abve

#[should_panic(
expected = "FixedSizeListArray child array length should be a multiple of 3"
)]
#[should_panic(expected = "assertion failed: (offset + length) <= self.len()")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a sanity check, and implies an invalid ArrayData, I'm not sure the complexity of having a nicer error message is worth it

if *len > 0 {
// check that child data is multiple of length
assert_eq!(
values.len() % *len as usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is redundant with the slice below

Comment on lines -166 to -175
assert_eq!(
data.buffers().len(),
0,
"FixedSizeListArray data should not contain a buffer for value offsets"
);
assert_eq!(
data.child_data().len(),
1,
"FixedSizeListArray should contain a single child array (values array)"
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are redundant as they would imply an invalid ArrayData

@@ -566,9 +615,9 @@ mod tests {
fixed_size_binary_array.value(1)
);
assert_eq!(2, fixed_size_binary_array.len());
assert_eq!(5, fixed_size_binary_array.value_offset(0));
assert_eq!(0, fixed_size_binary_array.value_offset(0));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice pushdown results in this changing

@tustvold tustvold added the api-change Changes to the arrow API label Apr 12, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive @tustvold -- some said it couldn't be done but seeing your string of incremental PRs over the last few months to remove ArrayData with the strongly typed abstractions has been inspirational

I know it seemingly took longer than it might have without the constraints of existing users, I think this change will be that much more valuable over the long term.

So thank you again

self.values.is_empty()
}

/// Returns a zero-copy slice of this array with the indicated offset and length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Self {
data_type: self.data_type.clone(),
keys: self.keys.slice(offset, length),
values: self.values.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the values aren't sliced because the keys may still refer to any logical value in the dictionary

fn nulls(&self) -> Option<&NullBuffer> {
self.dictionary.nulls()
}

fn get_buffer_memory_size(&self) -> usize {
self.dictionary.get_buffer_memory_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function and the one below it also account for the self.values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.values will get counted by self.dictionary, self.values is just a downcast borrow of the inner array

}

/// Returns the length for an element.
///
/// All elements have the same length as the array is a fixed size.
#[inline]
pub fn value_length(&self) -> i32 {
self.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed length to value_length to avoid confusion with len

self.len == 0
}

fn offset(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of offset in this new world where the offset is stored on the underlying typed array? Maybe we should remove it from the Array trait 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's certainly doesn't have clear semantics anymore. As this PR is already quite large I'll remove it in a follow on PR

assert_eq!(3, sliced_array.null_count());

for i in 0..sliced_array.len() {
if bit_util::get_bit(&null_bits, sliced_array.offset() + i) {
if bit_util::get_bit(&null_bits, 1 + i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I think maybe it would make sense to at least deprecate Array::offset as it is always 0 now (and as part of the deprecation message explain why)

fn get_buffer_memory_size(&self) -> usize {
self.data.get_buffer_memory_size()
let mut size = self.entries.get_buffer_memory_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other fields like keys, values, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys and values are stored within entries, they're the children of the StructArray. Perhaps it would be worth making entries StructArray 🤔

@@ -184,19 +164,15 @@ pub trait Array: std::fmt::Debug + Send + Sync {
/// # Example:
///
/// ```
/// use arrow_array::{Array, Int32Array};
/// use arrow_array::{Array, BooleanArray};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate this API for the reasons explained abve

// TODO: Slice buffers directly (#3880)
self.data.slice(offset, length).into()
let (offsets, fields) = match self.offsets.as_ref() {
Some(offsets) => (Some(offsets.slice(offset, length)), self.fields.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this at first -- some comments here (and/or in the self.offsets I think would help

Specifically I think if self.offsets is SOme(..) that means DenseUnion and the offsets point to the correct children so slicing the offsets but not the actual children makes sense

if self.offsets is None, then it is a sparse union, so all the children line up and we need to slice them individually.

I had to re-read https://arrow.apache.org/docs/format/Columnar.html#union-layout to make sure

_ => unreachable!(),
}
}
let offsets = match start_offset.as_usize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add some comments about what this is doing (namely that it is "rebasing" the offsets to all be zero based)? That would be easy to overlook and was somewhat encoded in the method names previously

@alamb
Copy link
Contributor

alamb commented Apr 12, 2023

As part of the completion of this project, @tustvold could you write up some description of the relationship between Buffer / PrimitveBuffer / ArrayData, etc?

Perhaps we can take inspiration (or copy/modify) the guide from:
https://jorgecarleitao.github.io/arrow2/main/guide/

If you think it makes sense I will file a ticket

@tustvold
Copy link
Contributor Author

If you think it makes sense I will file a ticket

I think this is definitely valuable, and will mesh nicely with #3879

@alamb
Copy link
Contributor

alamb commented Apr 12, 2023

If you think it makes sense I will file a ticket

Filed #4071

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

First-Class Array Abstractions
2 participants