You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
The default behavior of large_dtypes=True on merge() causes partition pruning to not work correctly for strings. This is my understanding of why this happens:
large_dtypes=True causes the source table strings to be casted to LargeUTF8. Thus, when datafusion optimizer runs type_coercion step on the physical plan, it goes from:
This means that the pruning predicate is not actually pruning any files. This is evident via the ParquetExec list of files, which shows all files, regardless of if they match the partition or not.
If you set large_dtypes=False, you see the following after the type_coercion optimization step:
TableScan: t, partial_filters=[CAST(Utf8("a") AS Dictionary(UInt16, Utf8)) = p]
Because there is no cast on the right-hand-side, the partition pruning works, and the ParquetExec only has the files from the a partition.
Before, the extra Filter means that even though the ParquetExec had all files, we could still filter out row groups based on the metadata. However, without this, we have to actually load in all the data, which is the symptom I experienced - this renders #1958 ineffective.
The correct resolution seems to be to add support in DataFusion's pruning for comparing between strings and large strings.
For my use case, setting large_dtypes=False seems to be a workaround.
The text was updated successfully, but these errors were encountered:
@echai58 I am going to close this because I ran into the same issue recently (#2844) and I believe that the default larger_dtypes setting was incorrect/problematic.
Please let me know if you believe a different issue persists here
Environment
Delta-rs version: 0.18.1
Binding: python
Bug
What happened:
The default behavior of
large_dtypes=True
onmerge()
causes partition pruning to not work correctly for strings. This is my understanding of why this happens:large_dtypes=True
causes the source table strings to be casted toLargeUTF8
. Thus, when datafusion optimizer runstype_coercion
step on the physical plan, it goes from:to
Then, datafusions pruning does not support casts of non numeric types:
https://github.com/apache/datafusion/blob/35.0.0/datafusion/core/src/physical_optimizer/pruning.rs#L813-L832
This means that the pruning predicate is not actually pruning any files. This is evident via the
ParquetExec
list of files, which shows all files, regardless of if they match the partition or not.If you set
large_dtypes=False
, you see the following after thetype_coercion
optimization step:Because there is no cast on the right-hand-side, the partition pruning works, and the
ParquetExec
only has the files from thea
partition.This also seems to be inadvertent an side-effect of the following change in #2326 cc @Blajda
https://github.com/delta-io/delta-rs/pull/2326/files#diff-12f59fe3c4440b7ae4ee1a5ac810b42c1d7357c246aae7b5770e840e52d3ec52L1036-R1039
Before, the extra
Filter
means that even though theParquetExec
had all files, we could still filter out row groups based on the metadata. However, without this, we have to actually load in all the data, which is the symptom I experienced - this renders #1958 ineffective.The correct resolution seems to be to add support in DataFusion's
pruning
for comparing between strings and large strings.For my use case, setting
large_dtypes=False
seems to be a workaround.The text was updated successfully, but these errors were encountered: