Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayData Enumeration for Primitive, Binary and UTF8 #3749

Merged
merged 12 commits into from
Feb 27, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Feb 22, 2023

Which issue does this PR close?

Part of #1799
Relates to #1176
Closes #1802

Rationale for this change

This starts the process of fleshing out what the elements of the ArrayData enumeration will consist of, with implementations for primitive and byte arrays. I wanted to get something up for review to get feedback on the general direction, and to also avoid a single massive PR.

What changes are included in this PR?

This PR adds enumerations ArrayDataPrimitive and ArrayDataBytes, that consist of the enumerable variants of the generic PrimitiveArrayData and BytesArrayData.

These provide for an idiomatic way to enumerate the possible array types, with the view that this would eventually replace the current trait object approach.

Additionally downcast / downcast_ref / From are provided to facilitate going from the enumeration to/from the generic form. This is critical to allowing types such as PrimitiveArray to be wrappers around PrimitiveArrayData whilst being constructable from an ArrayData enumeration containing ArrayDataPrimitive.

I anticipate functionality gradually being moved onto these ArrayData abstractions, to allow code-sharing with arrow2, and to reduce monomorphisation overheads for PrimitiveArray, but I've purposefully kept them simple for now

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 22, 2023
@tustvold tustvold changed the title Array data struct ArrayData Enumeration for Primitive, Binary and UTF8 Feb 22, 2023

/// A slice-able [`Buffer`] containing bit-packed booleans
#[derive(Debug, Clone)]
pub struct BooleanBuffer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This closes #1802 as discussed in #3700 this needs to be a new type to avoid breaking existing workloads

mod private {
use super::*;

pub trait BytesSealed {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two-level formulation is somewhat arcane, but allows BytesArrayData to be typed solely on actual physical types, without needing a logical ByteArrayType.


/// Describes the physical representation of a given [`DataType`]
#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)]
pub enum PhysicalType {
Copy link
Contributor Author

@tustvold tustvold Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors what the eventual ArrayData enumeration will look like, with each of these variants corresponding to an ArrayData variant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used for validating physical type when constructing typed ArrayData like PrimitiveArrayData.new does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipate it being used in other safe array data constructors, but yes this is the only current use-case


/// Cast [`BytesArrayData`] to [`ArrayDataBytesOffset`]
fn upcast<B: Bytes + ?Sized>(
v: BytesArrayData<Self, B>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What difference between ArrayDataBytes and BytesArrayData? Seems ArrayDataBytes is the data of bytes types (binary and str). But BytesArrayData?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it after reading the enums and struct BytesArrayData below. Maybe a descriptive comment at the top helpful.

@viirya viirya mentioned this pull request Feb 27, 2023
Comment on lines +45 to +47
pub trait Primitive: private::PrimitiveSealed + ArrowNativeType {
const VARIANT: PrimitiveType;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this trait and the const need document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some high-level documentation of the ArrayData "pattern" when I switch ArrayData over (and make these public)

Comment on lines 149 to 154
assert!(
matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT),
"Illegal physical type for PrimitiveArrayData, expected {:?} got {:?}",
T::VARIANT,
physical
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert!(
matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT),
"Illegal physical type for PrimitiveArrayData, expected {:?} got {:?}",
T::VARIANT,
physical
);
assert!(
matches!(physical, PhysicalType::Primitive(p) if p == T::VARIANT),
"Illegal physical type for PrimitiveArrayData of datatype {:?}, expected {:?} got {:?}",
data_type,
T::VARIANT,
physical
);

Comment on lines 40 to 43
pub enum OffsetType {
Small,
Large,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as this is for physical type, should it be something like Int32, Int64?

Comment on lines 123 to 125
DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary),
DataType::FixedSizeBinary(_) => Self::FixedSizeBinary,
DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep variable length binary close.

Suggested change
DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary),
DataType::FixedSizeBinary(_) => Self::FixedSizeBinary,
DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary),
DataType::Binary => Self::Bytes(OffsetType::Small, BytesType::Binary),
DataType::LargeBinary => Self::Bytes(OffsetType::Large, BytesType::Binary),
DataType::FixedSizeBinary(_) => Self::FixedSizeBinary,

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start.

@tustvold tustvold merged commit dae7a71 into apache:master Feb 27, 2023
@ursabot
Copy link

ursabot commented Feb 27, 2023

Benchmark runs are scheduled for baseline = 96791ea and contender = dae7a71. dae7a71 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sliceable BitMap
3 participants