Strongly Typed ArrayData #1799

tustvold · 2022-06-06T11:55:19Z

TLDR

Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.

This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see jorgecarleitao/arrow2#1429

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrayData is defined as follows.

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The buffers for this array data. Note that depending on the array types, this
    /// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
    /// at different positions.
    buffers: Vec<Buffer>,

    /// The child(ren) of this array. Only non-empty for nested types, currently
    /// `ListArray` and `StructArray`.
    child_data: Vec<ArrayData>,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,
}

This is simple, but has a couple of caveats:

It isn't clear what is present for specific layout types
There is no clear path to storing BooleanArray as BitMap vs Buffer, which would allow removing offset
Vec allocations for one or two elements (the C++ implementation inlines these)
There is potential for accidentally interpreting a buffer incorrectly

Describe the solution you'd like

Introduce a new ArrayDataLayout enumeration:

pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  Primitive{ values: Buffer },
  Offsets { offsets: Buffer, values: Buffer },
  Dictionary { keys: Buffer, values: ArrayData },
  List { offsets: Buffer, elements: ArrayData },
  Struct { children: Vec<ArrayData> },
  Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}

We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...

Describe alternatives you've considered

We could not do this

Additional context

This could be seen as an evolution of @HaoYang670 's proposal in #1640

It also relates to @jhorstmann 's proposal on #1499 (comment)

It could also be seen as an interpretation of the arrow2 physical vs logical type separation.

The text was updated successfully, but these errors were encountered:

alamb · 2022-06-06T12:08:30Z

This sounds like a great way to evolve

HaoYang670 · 2022-06-06T12:19:33Z

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}

It seems like that ArrayData::data_type is redundant because ArrayData::layout can also tell you the type of the array?

tustvold · 2022-06-06T12:45:21Z

In arrow-rs currently, as bitwise operations are related to Buffer but not BitMap, I guess Buffer is a better type for BooleanArray.

I specifically want to change this, as we can support zero-copy slicing of BitMap, but not Buffer (as you can't slice at the bit-level) - #1802

Curiously, why not directly declare each type of Array with Buffers, for example:

A couple of reasons:

There are operations that don't and shouldn't care about the logical type, e.g. IPC, FFI, nullif, etc... ArrayData provides this
Buffer is not strongly typed and so is not a drop-in replacement for RawPtrBox (which I also have a separate plan to tweak)
Reduces code churn, this could theoretically not even be a breaking change

Basically I see Array and ArrayData filling different roles:

Array, ArrayBuilder, etc... are user-facing and should prioritise providing a strongly-typed, idiomatic, safe API for users
ArrayData is the low level API, with a focus on interoperability with other arrow systems

It seems like that ArrayData::data_type is redundant because ArrayData::layout can also tell you the type of the array?

You still need the DataType to roundtrip the actual type, e.g. int32 vs uint32, the Field for nested types, etc...

HaoYang670 · 2022-06-06T13:14:54Z

The reason that I prefer removing ArrayData::data_type is that it introduces the possibility of the inconsistency between ArrayData::data_type and ArrayData::layout. And this could increase the workload of ArrayData::validate (lots of pattern matching ...).

You still need the DataType to roundtrip the actual type, e.g. int32 vs uint32, the Field for nested types, etc...

The first way I thought is that we could inject dataType into ArrayDataLayout. For example:

pub enum ArrayDataLayout {
  ...
  Primitive(type: PrimitiveType, values: Buffer },
  Binary (is_large: Boolean, values: Buffer ...},
  ...
}

pub enum PrimitiveType {
    Int32,
    Int64, 
    ...
}

But this cannot support nested types well.

My second thought is that we could refactor DataType like this:

enum DataType {
    Primitive(type: PrimitiveType)
    List(type: ListType)
    ...
}

enum PrimitiveType {
    Int32,
    Int64,
    ...
}

enum ListType {
    List(Box<Field>),
    FixedSizeList(Box<Field>, i32),
    LargeList(Box<Field>),
}

I guess this could decrease the workload of ArrayData::validate.

tustvold · 2022-06-06T13:25:59Z

My second thought is that we could refactor DataType like this:

I think this would break the desired separation of logical vs physical types. We don't want callers to have to match on all the possible DataType variants in order to interpret the buffers. An IPC writer shouldn't care if it is a primitive array of floats, or int32, just that it is a buffer of n bytes.

introduces the possibility of the inconsistency between ArrayData::data_type and ArrayData::layout
I guess this could decrease the workload of ArrayData::validate.

I don't think this is something to optimise for, we can easily add a cheap sanity check on construction. Tbh I don't think it would be all that bad match data_type would just change to match (data_type, layout)? I would be extremely hesitant to make fundamental changes to DataType.

jhorstmann · 2022-06-07T08:21:37Z

I like it! One additional idea would be to change the layout of Boolean to use a Bitmap instead of Buffer:

pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  ...

If the Bitmap itself then stores a bit offset, we could remove the offset from ArrayData. Slicing of Bitmap would then always be zero-copy and slicing an array would push down the slicing into the validity buffer and layout.

This could in theory lead to a situation where you have different bit offsets for validity and data in a BooleanArray. Probably not a problem for any compute kernels, but could require copying in the FFI layer.

tustvold · 2022-06-07T08:23:31Z

That's actually a typo, I meant to do exactly that and store Booleans as Bitmap 😅 will update

* Return Buffers from ArrayData::buffers instead of slice (#1799) * Clippy

* Add RunEndBuffer (#1799) * Fix test * Revert rename * Format * Clippy * Remove unnecessary check * Fix * Tweak docs * Add docs

* Add ArrayDataLayout (#1799) * Fix ArrayData::buffer * Don't export macros, yet * Fix doc * Review feedback * Further review feedback

tustvold · 2023-03-16T17:46:28Z

Having worked through an implementation of this, the additional branching on operations that used to be free, e.g. looking up the datatype or null buffer, causes some quite serious performance regressions. Whilst it is possible to eliminate these, it turns into a game of performance wack-a-mole. Whilst I still think the design as articulated here has some compelling advantages, pragmatically I don't have the time or the inclination to work through every implementation fixing such regressions.

Taking a step back, the enumerations are not strictly necessary to achieve the goal of a type-safe, low-level data abstraction that can form a common basis between arrow and arrow2, as articulated here.

Instead the modified plan is as follows:

ArrayData remains as is without modification
The ArrayData enumerations and associated trait plumbing are removed
The strongly typed ArrayData structures, (e.g. PrimitiveArrayData, BytesArrayData), are made public
Provide From conversions between ArrayData and *ArrayData, etc...
Provide From conversions between *Array and *ArrayData
Implement From conversions between ArrayData and *Array via *ArrayData

We can then slowly reduce the usage of ArrayData within the codebase, in favor of *Array and *ArrayData. This achieves the stated goals, without requiring everything to move at the same time.

Edit: On further reflection there may be an even simpler option

tustvold · 2023-03-17T12:52:44Z

Closing in favour of #3880 and #3879

tustvold added question Further information is requested enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API labels Jun 6, 2022

tustvold mentioned this issue Jun 6, 2022

Sliceable BitMap #1802

Closed

This was referenced Jun 7, 2022

Replace RawPtrBox with Safe Abstraction #1811

Closed

Fix Decimal and List ArrayData Validation (#1813) (#1814) #1816

Merged

This was referenced Jul 15, 2022

Support Sliced Complex Types In IPC Writer #2080

Closed

Handle offset consistently for StructArray (#1750) #2085

Closed

Inconsistent Slicing of ArrayData for StructArray #1750

Closed

tustvold mentioned this issue Feb 15, 2023

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

Closed

tustvold self-assigned this Feb 22, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 1, 2023

Return Buffers from ArrayData::buffers instead of slice (apache#1799)

7d7e266

tustvold mentioned this issue Mar 1, 2023

Return Buffers from ArrayData::buffers instead of slice (#1799) #3783

Merged

tustvold added a commit that referenced this issue Mar 2, 2023

Return Buffers from ArrayData::buffers instead of slice (#1799) (#3783)

231ae9b

* Return Buffers from ArrayData::buffers instead of slice (#1799) * Clippy

tustvold mentioned this issue Mar 7, 2023

Restrict DictionaryArray to ArrowDictionaryKeyType #3813

Merged

alamb mentioned this issue Mar 7, 2023

[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository jorgecarleitao/arrow2#1429

Open

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 7, 2023

Add RunEndBuffer (apache#1799)

5347a44

tustvold mentioned this issue Mar 7, 2023

Add RunEndBuffer (#1799) #3817

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023

Add ArrayDataLayout (apache#1799)

e741c29

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023

Add ArrayDataLayout (apache#1799)

f4baea8

tustvold added a commit that referenced this issue Mar 8, 2023

Add RunEndBuffer (#1799) (#3817)

36f2db3

* Add RunEndBuffer (#1799) * Fix test * Revert rename * Format * Clippy * Remove unnecessary check * Fix * Tweak docs * Add docs

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023

Add ArrayDataLayout (apache#1799)

578d8a4

alamb mentioned this issue Mar 9, 2023

Add ArrayDataLayout, port validation (#1799) #3818

Merged

tustvold added a commit that referenced this issue Mar 9, 2023

Add ArrayDataLayout, port validation (#1799) (#3818)

495682a

* Add ArrayDataLayout (#1799) * Fix ArrayData::buffer * Don't export macros, yet * Fix doc * Review feedback * Further review feedback

tustvold mentioned this issue Mar 14, 2023

Add BitIterator #3856

Merged

tustvold changed the title ~~ArrayData Layout Enumeration~~ Strongly Typed ArrayData Mar 16, 2023

This was referenced Mar 16, 2023

Rework Strongly Typed ArrayData Abstractions #3877

Closed

Array Destructuring APIs #3879

Closed

First-Class Array Abstractions #3880

Closed

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Mar 17, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 21, 2023

Revert structured ArrayData (apache#1799)

9b8d7ae

tustvold mentioned this issue Mar 21, 2023

Revert structured ArrayData (#3877) #3894

Merged

tustvold closed this as completed in #3894 Mar 21, 2023

tustvold added a commit that referenced this issue Mar 21, 2023

Revert structured ArrayData (#1799) (#3894)

ae4db60

spebern pushed a commit to spebern/arrow-rs that referenced this issue Mar 25, 2023

Revert structured ArrayData (apache#1799) (apache#3894)

6ad251b

tustvold mentioned this issue Jul 29, 2023

Cleanup ArrayData::buffers #4583

Merged

alamb mentioned this issue Aug 5, 2024

Implement full validation for UnionArrays construction from ArrayData #1486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strongly Typed ArrayData #1799

Strongly Typed ArrayData #1799

tustvold commented Jun 6, 2022 •

edited by alamb

Loading

alamb commented Jun 6, 2022

HaoYang670 commented Jun 6, 2022

tustvold commented Jun 6, 2022 •

edited

Loading

HaoYang670 commented Jun 6, 2022

tustvold commented Jun 6, 2022

jhorstmann commented Jun 7, 2022

tustvold commented Jun 7, 2022 •

edited

Loading

tustvold commented Mar 16, 2023 •

edited

Loading

tustvold commented Mar 17, 2023

Strongly Typed ArrayData #1799

Strongly Typed ArrayData #1799

Comments

tustvold commented Jun 6, 2022 • edited by alamb Loading

alamb commented Jun 6, 2022

HaoYang670 commented Jun 6, 2022

tustvold commented Jun 6, 2022 • edited Loading

HaoYang670 commented Jun 6, 2022

tustvold commented Jun 6, 2022

jhorstmann commented Jun 7, 2022

tustvold commented Jun 7, 2022 • edited Loading

tustvold commented Mar 16, 2023 • edited Loading

tustvold commented Mar 17, 2023

tustvold commented Jun 6, 2022 •

edited by alamb

Loading

tustvold commented Jun 6, 2022 •

edited

Loading

tustvold commented Jun 7, 2022 •

edited

Loading

tustvold commented Mar 16, 2023 •

edited

Loading