-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strongly Typed ArrayData #1799
Comments
This sounds like a great way to evolve |
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
/// The number of elements in this array data
len: usize,
/// The number of null elements in this array data
null_count: usize,
/// The offset into this array data, in number of items
offset: usize,
/// The null bitmap. A `None` value for this indicates all values are non-null in
/// this array.
null_bitmap: Option<Bitmap>,
/// The array data layout
layout: ArrayDataLayout
} It seems like that |
I specifically want to change this, as we can support zero-copy slicing of BitMap, but not Buffer (as you can't slice at the bit-level) - #1802
A couple of reasons:
Basically I see Array and ArrayData filling different roles:
You still need the DataType to roundtrip the actual type, e.g. int32 vs uint32, the Field for nested types, etc... |
The reason that I prefer removing
The first way I thought is that we could inject pub enum ArrayDataLayout {
...
Primitive(type: PrimitiveType, values: Buffer },
Binary (is_large: Boolean, values: Buffer ...},
...
}
pub enum PrimitiveType {
Int32,
Int64,
...
} But this cannot support nested types well. My second thought is that we could refactor enum DataType {
Primitive(type: PrimitiveType)
List(type: ListType)
...
}
enum PrimitiveType {
Int32,
Int64,
...
}
enum ListType {
List(Box<Field>),
FixedSizeList(Box<Field>, i32),
LargeList(Box<Field>),
} I guess this could decrease the workload of |
I think this would break the desired separation of logical vs physical types. We don't want callers to have to match on all the possible
I don't think this is something to optimise for, we can easily add a cheap sanity check on construction. Tbh I don't think it would be all that bad |
I like it! One additional idea would be to change the layout of Boolean to use a Bitmap instead of Buffer:
If the Bitmap itself then stores a bit offset, we could remove the offset from ArrayData. Slicing of Bitmap would then always be zero-copy and slicing an array would push down the slicing into the validity buffer and layout. This could in theory lead to a situation where you have different bit offsets for validity and data in a BooleanArray. Probably not a problem for any compute kernels, but could require copying in the FFI layer. |
That's actually a typo, I meant to do exactly that and store Booleans as Bitmap 😅 will update |
Having worked through an implementation of this, the additional branching on operations that used to be free, e.g. looking up the datatype or null buffer, causes some quite serious performance regressions. Whilst it is possible to eliminate these, it turns into a game of performance wack-a-mole. Whilst I still think the design as articulated here has some compelling advantages, pragmatically I don't have the time or the inclination to work through every implementation fixing such regressions. Taking a step back, the enumerations are not strictly necessary to achieve the goal of a type-safe, low-level data abstraction that can form a common basis between arrow and arrow2, as articulated here. Instead the modified plan is as follows:
We can then slowly reduce the usage of Edit: On further reflection there may be an even simpler option |
TLDR
Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.
This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see jorgecarleitao/arrow2#1429
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently
ArrayData
is defined as follows.This is simple, but has a couple of caveats:
BooleanArray
asBitMap
vsBuffer
, which would allow removingoffset
Describe the solution you'd like
Introduce a new
ArrayDataLayout
enumeration:We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...
Describe alternatives you've considered
We could not do this
Additional context
This could be seen as an evolution of @HaoYang670 's proposal in #1640
It also relates to @jhorstmann 's proposal on #1499 (comment)
It could also be seen as an interpretation of the arrow2 physical vs logical type separation.
The text was updated successfully, but these errors were encountered: