Replies: 7 comments 10 replies
-
For example: enum LogicalType {
Int, // no sizes at this level
UInt,
String, // no Large variants
Date,
Time,
Timestamp,
List(Box<LogicalType>),
Struct(..)
} |
Beta Was this translation helpful? Give feedback.
-
I agree with simplifying String so it doesn't have Large/Small or encoding. But why hide integer sizes? Also, how might this intersect with extension types? Could |
Beta Was this translation helpful? Give feedback.
-
Thanks @alamb to kick off the discussion here! I'm very much supportive of this change. If we agree this is the right direction to go, I can also volunteer to work on this in DF (haven't touched it for a long while 😂 ) |
Beta Was this translation helpful? Give feedback.
-
Note that the Arrow format already has the terminology of physical layout and logical type. My criteria would be:
The complexity of adding functions increases with the number of "logical types"
The complexity of adding core functionality increases with the number of "encodings"
There's been discussion already of the fact that one type can have many different physical layouts (Int64, REE, Dictionary but also note that two vastly different types can have the same physical layout. For example, both Int64 and Timestamp have the same physical layout (fixed-size-binary<8>) |
Beta Was this translation helpful? Give feedback.
-
Another thing that often comes up when discussing encodings is whether the encoding of a column should be allowed to change throughout the execution of a query. For example, if a column is ree-encoded in one parquet file but plain-encoded in a different parquet file do you really need to normalize them or can you process them with as little manipulation as possible. |
Beta Was this translation helpful? Give feedback.
-
I think this is a good idea. Agree that we should continue to encode size/precision in the logical types (eg One thing that would be extremely useful for us as well is to allow for extension/user-defined types. So (roughly) something like: trait UserDefinedType {
// for schema validation/type checking
fn eq(&self, other: &dyn UserDefinedType);
}
enum LogicalType {
...
Extension(Box<dyn UserDefinedType>)
} Obviously this could be used as a mechanism to create new, well-defined types but also a mechanism to allow for dealing with more dynamic data though extension operators and/or udf/udaf. |
Beta Was this translation helpful? Give feedback.
-
FYI @yukkit has created a PR showing how |
Beta Was this translation helpful? Give feedback.
-
This is inspired by a discussion from @sunchao and @tustvold on apache/arrow-rs#4729
Background
In general the choice of the best encoding to use for any dataset depends on
DataFusion uses the Arrow type system, by which I mean it uses
DataType
to describe the schema of data as it flows through plans. This choice makes it easy toHowever it also has downsides:
DataType::RunEndEncoded
(or the newly incorporatedStringView
) which will appear as new DataTypes, require changes in many places in DataFusion ( like type casting / coercion), Many of the changes really only depend on the logical type (String vs Int). This is likely part of the reason DataFusion doesn't yet use REEArraysIdea:
LogicalType
The high level idea would be introduce a
LogicalType
in DataFuson for the purposes of planning, and keep the physical encodings out of planning. For exampleLogicalType
would not haveTimestampSecond
TimestampMillisecond,
TimestampMicrosecondand TimestampNanosecond
it would haveTimestamp
Benefits
Using
LogicalType
would allow us to move the choice of encoding / decoding into the actual operator and arrow kernels (streams) themselves which could decide to operate directly on dictionaries if that was appropriate or unpack the results if that was better .It is my understanding that this is what DuckDB does
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions