Large Types breaks merge predicate pruning #2632

echai58 · 2024-06-28T17:05:59Z

Environment

Delta-rs version: 0.18.1

Binding: python

Bug

What happened:
The default behavior of large_dtypes=True on merge() causes partition pruning to not work correctly for strings. This is my understanding of why this happens:

large_dtypes=True causes the source table strings to be casted to LargeUTF8. Thus, when datafusion optimizer runs type_coercion step on the physical plan, it goes from:

TableScan: t, partial_filters=[LargeUtf8("a") = p]

to

TableScan: t, partial_filters=[LargeUtf8("a") = CAST(p AS LargeUtf8)]

Then, datafusions pruning does not support casts of non numeric types:
https://github.com/apache/datafusion/blob/35.0.0/datafusion/core/src/physical_optimizer/pruning.rs#L813-L832

This means that the pruning predicate is not actually pruning any files. This is evident via the ParquetExec list of files, which shows all files, regardless of if they match the partition or not.

If you set large_dtypes=False, you see the following after the type_coercion optimization step:

TableScan: t, partial_filters=[CAST(Utf8("a") AS Dictionary(UInt16, Utf8)) = p]

Because there is no cast on the right-hand-side, the partition pruning works, and the ParquetExec only has the files from the a partition.

This also seems to be inadvertent an side-effect of the following change in #2326 cc @Blajda
https://github.com/delta-io/delta-rs/pull/2326/files#diff-12f59fe3c4440b7ae4ee1a5ac810b42c1d7357c246aae7b5770e840e52d3ec52L1036-R1039

Before, the extra Filter means that even though the ParquetExec had all files, we could still filter out row groups based on the metadata. However, without this, we have to actually load in all the data, which is the symptom I experienced - this renders #1958 ineffective.

The correct resolution seems to be to add support in DataFusion's pruning for comparing between strings and large strings.

For my use case, setting large_dtypes=False seems to be a workaround.

The text was updated successfully, but these errors were encountered:

rtyler · 2024-07-05T15:43:53Z

@echai58 I am going to close this because I ran into the same issue recently (#2844) and I believe that the default larger_dtypes setting was incorrect/problematic.

Please let me know if you believe a different issue persists here

echai58 added the bug Something isn't working label Jun 28, 2024

rtyler closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Types breaks merge predicate pruning #2632

Large Types breaks merge predicate pruning #2632

echai58 commented Jun 28, 2024

rtyler commented Jul 5, 2024

Large Types breaks merge predicate pruning #2632

Large Types breaks merge predicate pruning #2632

Comments

echai58 commented Jun 28, 2024

Environment

Bug

rtyler commented Jul 5, 2024