Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArrayDataLayout, port validation (#1799) #3818

Merged
merged 6 commits into from
Mar 9, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 8, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

Switching to the ArrayData enumeration requires a substantial amount of glue logic to convert between the struct and enum layouts. This PR therefore adds

  • An ArrayDataLayout that can be used to get a struct view of the enum variants, for the purposes of validation, etc..
  • Methods to create the enum variants from ArrayDataBuilder
  • Methods to slice array data variants

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 8, 2023
@tustvold tustvold marked this pull request as ready for review March 8, 2023 20:06
&self.0
}

/// Returns the inner [`Buffer`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the inner [`Buffer`]
/// Returns the inner [`ScalarBuffer`]

arrow-buffer/src/buffer/offset.rs Show resolved Hide resolved
///
/// Panics if
/// - `nulls` and `values` are different lengths
/// - `data_type` is not compatible with `T`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is T here? Maybe?

Suggested change
/// - `data_type` is not compatible with `T`
/// - `data_type` is not compatible with `PhysicalType::Boolean`

use arrow_buffer::buffer::{BooleanBuffer, NullBuffer};
use arrow_schema::DataType;

#[derive(Debug, Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -116,22 +170,67 @@ impl<E: RunEnd> RunArrayData<E> {
/// # Safety
///
/// - `data_type` must be valid for this layout
/// - `run_ends` must contain monotonically increasing, positive values `<= child.len()`
/// - `run_ends` must contain monotonically increasing, positive values `<= len`
/// - `run_ends.len() == child.len()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what child refers to. The variable child in below function is built using scalar buffer and its length will not be equal to run_ends.len().
Also, there are two children. So it'll be clear to mention which child is being referred in the comment.

@alamb alamb changed the title Add ArrayDataLayout (#1799) Add ArrayDataLayout, port validation (#1799) Mar 9, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I went through this PR carefully.

I think this PR is related to #1799. I think it is important to understand the context for this PR. Specifically, as I understand it the rationale for this work is to provide an easy and zero cost conversion between arrow-rs and arrow2 arrays (see jorgecarleitao/arrow2#1429 for more details) by unifying the lowest levels using the ArrayData enumeration

Am I right in concluding that extracting ArrayDataLayout allow the data validation logic to be used in the new structure?

While this does create non trivial tech-debt (as now there are multiple ways to get at the lowest level arrow data), I think with some more documentation and providing some good examples we'll be in good shape to migrate them slowly over.

I am not sure how important it is to leave this PR open for longer to ensure anyone who might have an opinion has a chance to comment -- When you are ready to switch to the new ArrayDataEnumeration that certainly should have some comment time.

cc @ritchie46 and @sundy-li

@@ -139,4 +139,9 @@ impl BooleanBuffer {
pub fn inner(&self) -> &Buffer {
&self.buffer
}

/// Returns the inner [`Buffer`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the inner [`Buffer`]
/// Returns the inner [`Buffer`], consuming self

&self.0
}

/// Returns the inner [`ScalarBuffer`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the inner [`ScalarBuffer`]
/// Returns the inner [`ScalarBuffer`], consuming self

arrow-buffer/src/buffer/scalar.rs Show resolved Hide resolved
&self.buffer
}

/// Returns the inner [`Buffer`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the inner [`Buffer`]
/// Returns the inner [`Buffer`], consuming self

buffers: Buffers::two(self.offsets.inner().inner(), &self.values),
child_data: &[],
}
}
}

/// ArrayData for [fixed-size arrays](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout) of bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right link?

/// The children of this RunArrayData
/// 1: the run ends
/// 2: the values
children: Box<[ArrayData; 2]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is there a reason to store the run ends in children[0] as well as in run_ends? Aka why the redundancy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we can return a slice of children in ArrayDataLayout, this type is a little funky because there is an impedence mismatch between the arrow representation and the arrow-rs representation

arrow-data/src/data/struct.rs Show resolved Hide resolved
@@ -33,13 +33,17 @@ use crate::equal;
mod buffers;
pub use buffers::*;

#[allow(unused)] // Private until ready (#1176)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if #1799 would be a better link?

#1176 was an old discussion ticket

arrow-data/src/data/mod.rs Show resolved Hide resolved
arrow-data/src/data/mod.rs Show resolved Hide resolved
@tustvold
Copy link
Contributor Author

tustvold commented Mar 9, 2023

Am I right in concluding that extracting ArrayDataLayout allow the data validation logic to be used in the new structure?

Yup, it is the glue logic to avoid needing to change everything at once. Eventually it will be removed. It will never be made public.

I am not sure how important it is to leave this PR open for longer to ensure anyone who might have an opinion has a chance to comment

As this PR is almost entirely crate-private I intend to get this in, and we can continue to iterate on it

@tustvold tustvold merged commit 495682a into apache:master Mar 9, 2023
@ursabot
Copy link

ursabot commented Mar 9, 2023

Benchmark runs are scheduled for baseline = 2a3fd96 and contender = 495682a. 495682a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants