Change DataFusion to use LogicalTypes and hide "encodings" (like Dictionary and REE) lower down #7421

alamb · 2023-08-25T22:24:42Z

alamb
Aug 25, 2023
Collaborator

This is inspired by a discussion from @sunchao and @tustvold on apache/arrow-rs#4729

Background

In general the choice of the best encoding to use for any dataset depends on

Where the data is coming from (e.g. the parquet encoder often can make DictionaryArrays very efficiently as that matches how the data is internally represented)
properties about the data itself (e.g. how many distinct values it has)
What operations are applied (e.g. comparison kernels, string manipulations, arithmetic, etc)

DataFusion uses the Arrow type system, by which I mean it uses DataType to describe the schema of data as it flows through plans. This choice makes it easy to

Understand
Map (eventually)to Arrow arrays,

However it also has downsides:

That the encoding (like if the data is dictionary encoded strings or just strings) is exposed to the entire engine
Can not vary dynamically during plan time and must be specified / chosen at plan time
Supporting new encodings like DataType::RunEndEncoded (or the newly incorporated StringView) which will appear as new DataTypes, require changes in many places in DataFusion ( like type casting / coercion), Many of the changes really only depend on the logical type (String vs Int). This is likely part of the reason DataFusion doesn't yet use REEArrays

Idea: `LogicalType`

The high level idea would be introduce a LogicalType in DataFuson for the purposes of planning, and keep the physical encodings out of planning. For example LogicalType would not have TimestampSecond TimestampMillisecond, TimestampMicrosecond and TimestampNanosecond it would have Timestamp

Benefits

Using LogicalType would allow us to move the choice of encoding / decoding into the actual operator and arrow kernels (streams) themselves which could decide to operate directly on dictionaries if that was appropriate or unpack the results if that was better .

It is my understanding that this is what DuckDB does

Thoughts?

alamb · 2023-08-25T22:26:49Z

alamb
Aug 25, 2023
Collaborator Author

For example:

enum LogicalType {
  Int, // no sizes at this level
  UInt,
  String, // no Large variants
  Date,
  Time,
  Timestamp,
  List(Box<LogicalType>),
  Struct(..)
}

0 replies

wjones127 · 2023-08-25T22:35:23Z

wjones127
Aug 25, 2023
Collaborator

I agree with simplifying String so it doesn't have Large/Small or encoding. But why hide integer sizes?

Also, how might this intersect with extension types? Could UUID, JSON, BSON, (fixed and variable shapes) tensor types, and geospatial types be supported as a logical type, while their storage types be kept as physical Arrow types? I think the extension type is relevant to planning as these data types define what additional operations are possible on these column types. For example, JSON and BSON can have fields accessors that wouldn't be relevant on their string and binary storage types.

6 replies

sunchao Aug 25, 2023
Collaborator

By looking at other SQL systems (for instance, Spark) I'm inclined to still keep integers with different sizes (e.g., short, int, long etc) in logical type. I think this information can be leveraged during execution. For example, depending on the types we can do different optimizations in hash aggregation, such as using an array instead of a hash table for group by columns with short type.

wjones127 Aug 25, 2023
Collaborator

I guess my definition of "Logical Type" would be: what type scalar value do I get back when I access a single element? For a String column, regardless of encoding and offset size, I will get back a string. But for an integer, the exact type depends on the integer width. Similar with timestamps; I think people actually do care what the resolution is.

westonpace Aug 25, 2023

what type scalar value do I get back when I access a single element?

Technically this is language-dependent. Java strings have 4-byte lengths. So a large string and a string result in different scalars.

I think String/LargeString is in the same bucket as Int32/Int64. However, I think REE/Dictionary are different.

viirya Aug 26, 2023
Collaborator

Based on what I read from the description, the proposed logical types seems to hide physical details (like encoding) from query planning phase? If so, why we need to have the aspect of size of integers in logical types because they can be leveraged during execution? During execution, shouldn't it already converted to physical types?

westonpace Aug 26, 2023

It's possible there are three levels:

Concepts (e.g. Int, Timestamp) exist only at the logical level
Types (e.g. Int64, Timestamp) exist in the physical plan
Physical Layouts (e.g. Plain, Dictionary, REE) exist in the physical plan

sunchao · 2023-08-25T23:12:04Z

sunchao
Aug 25, 2023
Collaborator

Thanks @alamb to kick off the discussion here! I'm very much supportive of this change. If we agree this is the right direction to go, I can also volunteer to work on this in DF (haven't touched it for a long while 😂 )

0 replies

westonpace · 2023-08-25T23:42:02Z

westonpace
Aug 25, 2023

Note that the Arrow format already has the terminology of physical layout and logical type.

My criteria would be:

If every value in encoding A can also be represented in encoding B then they are two different "physical layouts". (e.g. every 64-bit integer can be expressed as Int64 / Dictionary and REE
If every value in encoding A can NOT be represented in encoding B then they are two different "logical types" (e.g. 1000 can be expressed as Int64 but not as Int8. 6GiB strings can be expressed as LargeString but not String)

The complexity of adding functions increases with the number of "logical types"

One does not have to create a separate "add" kernel for "Int64" and "REE" but one does need to create a separate "add" kernel for Int64 and Int16.

The complexity of adding core functionality increases with the number of "encodings"

The logic for an Int64 array builder should be fairly identical to the logic for a Timestamp builder (both use the same physical layout). However, the logic for an REE array builder is very different from the logic for an Int64 array builder.

There's been discussion already of the fact that one type can have many different physical layouts (Int64, REE, Dictionary but also note that two vastly different types can have the same physical layout. For example, both Int64 and Timestamp have the same physical layout (fixed-size-binary<8>)

1 reply

alamb Aug 26, 2023
Collaborator Author

Given the above discussion, here is what LogicalType might look like using the definition that @westonpace uses above

enum LogicalType {
  Int8, Int16, Int32, Int64,
  UInt8, UInt16, UInt32, UInt64,
  String, LargeString,
  Date32, Date64,
  TimestampSecond, TimestampMillisecond,  TimestampMicrosecond,  TimestampNanosecond,
  List(Box<LogicalType>), LargeList(Box<LogicalType>),
  FixedSizeList,
  Struct(..)
}

This also has the nice property that the code in DataFusion would handle basically the same set of types as it does today (the only thing LogicalType it doesn't have is Dictionary, REE, etc).

westonpace · 2023-08-26T00:14:19Z

westonpace
Aug 26, 2023

Another thing that often comes up when discussing encodings is whether the encoding of a column should be allowed to change throughout the execution of a query. For example, if a column is ree-encoded in one parquet file but plain-encoded in a different parquet file do you really need to normalize them or can you process them with as little manipulation as possible.

3 replies

alamb Aug 26, 2023
Collaborator Author

Another thing that often comes up when discussing encodings is whether the encoding of a column should be allowed to change throughout the execution of a query. For example, if a column is ree-encoded in one parquet file but plain-encoded in a different parquet file do you really need to normalize them or can you process them with as little manipulation as possible.

I think it would be very valuable to allow the encoding to change throughout the query execution, including dynamically during runtime based on actual properties.

In addition to the example of different encodings in different parquet files, dynamically changing the encoding could help for filtering cases too -- given an REE input, if nothing (or very little) was filtered, the REE encoding could be kept, but if after filtering only small runs were left, then producing unencoded output would probably be better

viirya Aug 27, 2023
Collaborator

I remember that in many times we encounter schema mismatch error when encoding is changed during query execution. It comes from assembling RecordBatch from output arrays. For example, one array is changed to dictionary encoding. As far as I know, the schema of physical operators of DataFusion is fixed after initialization, so the change of dictionary-encoding on output arrays causes such error. I guess that the schema bound to physical operators should be physical data type instead of logical one, how do we overcome this issue by logical type?

alamb Aug 28, 2023
Collaborator Author

As far as I know, the schema of physical operators of DataFusion is fixed after initialization, so the change of dictionary-encoding on output arrays causes such error.

Somehow I think we would need to allow / make the physical operators handle any encoding (and by default fallback to unpacking dictionaries / REE)

thinkharderdev · 2023-08-30T16:59:50Z

thinkharderdev
Aug 30, 2023
Collaborator

I think this is a good idea. Agree that we should continue to encode size/precision in the logical types (eg LogicalType::Int8/16/32/64 instead of just LogicalType::Int).

One thing that would be extremely useful for us as well is to allow for extension/user-defined types. So (roughly) something like:

trait UserDefinedType {
    // for schema validation/type checking
    fn eq(&self, other: &dyn UserDefinedType);
}

enum LogicalType {
    ...
    Extension(Box<dyn UserDefinedType>)
}

Obviously this could be used as a mechanism to create new, well-defined types but also a mechanism to allow for dealing with more dynamic data though extension operators and/or udf/udaf.

0 replies

alamb · 2023-11-14T22:06:11Z

alamb
Nov 14, 2023
Collaborator Author

FYI @yukkit has created a PR showing how LogicalTypes might work in DataFusion: #8143. It is a pretty neat idea.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change DataFusion to use LogicalTypes and hide "encodings" (like Dictionary and REE) lower down #7421

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Change DataFusion to use LogicalTypes and hide "encodings" (like Dictionary and REE) lower down #7421

alamb Aug 25, 2023 Collaborator

Background

Idea: LogicalType

Benefits

Replies: 7 comments · 10 replies

alamb Aug 25, 2023 Collaborator Author

wjones127 Aug 25, 2023 Collaborator

sunchao Aug 25, 2023 Collaborator

wjones127 Aug 25, 2023 Collaborator

westonpace Aug 25, 2023

viirya Aug 26, 2023 Collaborator

westonpace Aug 26, 2023

sunchao Aug 25, 2023 Collaborator

westonpace Aug 25, 2023

alamb Aug 26, 2023 Collaborator Author

westonpace Aug 26, 2023

alamb Aug 26, 2023 Collaborator Author

viirya Aug 27, 2023 Collaborator

alamb Aug 28, 2023 Collaborator Author

thinkharderdev Aug 30, 2023 Collaborator

alamb Nov 14, 2023 Collaborator Author

alamb
Aug 25, 2023
Collaborator

Idea: `LogicalType`

Replies: 7 comments 10 replies

alamb
Aug 25, 2023
Collaborator Author

wjones127
Aug 25, 2023
Collaborator

sunchao Aug 25, 2023
Collaborator

wjones127 Aug 25, 2023
Collaborator

viirya Aug 26, 2023
Collaborator

sunchao
Aug 25, 2023
Collaborator

westonpace
Aug 25, 2023

alamb Aug 26, 2023
Collaborator Author

westonpace
Aug 26, 2023

alamb Aug 26, 2023
Collaborator Author

viirya Aug 27, 2023
Collaborator

alamb Aug 28, 2023
Collaborator Author

thinkharderdev
Aug 30, 2023
Collaborator

alamb
Nov 14, 2023
Collaborator Author