Merge: Filtering on partitions #1918

emcake · 2023-11-28T22:25:35Z

Description

I'd like merge to offer a partitions argument to reduce table churn and processing.

Use Case

Currently the merge operation consumes the whole table and merges in new source data. Currently this re-writes every file in the table: https://gist.github.com/emcake/acc1aa233339a5b3534e2f54702dd46e

For large tables where the updated surface area is small, this is inefficient. I'd propose we have a new parameter partitions: Optional[Union[List[PartitionValues], Literal['auto']]] = None which can provide a list of partitions for the merge operation to be restricted to.

Usage:

# current behaviour
table.merge(source_data, partitions=None) 

# restrict merge to files with the listed PartitionValues. If data in source_data is outside these partitions, it's dropped.
table.merge(source_data, partitions=[{ 'col_a' : 1, 'col_b' : 'foo' }, {'col_a' : 1, 'col_b' : 'bar'}, ...]) 

# use the table partition columns and find all distinct tuples of values in source_data
table.merge(source_data, partitions='auto')

Related Issue(s)

The text was updated successfully, but these errors were encountered:

thomasfrederikhoeck · 2023-11-29T08:02:57Z

Related to #1846 . I think ideally that the predicate is used like in the example instead of having a seperate keyword.

ion-elgreco · 2023-11-29T08:07:28Z

@thomasfrederikhoeck I agree, the predicate can be used together with statistics to do file pruning. @Blajda mentions this also in that issue.

Closing this as duplicate.

emcake added the enhancement New feature or request label Nov 28, 2023

ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge: Filtering on partitions #1918

Merge: Filtering on partitions #1918

emcake commented Nov 28, 2023 •

edited

Loading

thomasfrederikhoeck commented Nov 29, 2023

ion-elgreco commented Nov 29, 2023

Merge: Filtering on partitions #1918

Merge: Filtering on partitions #1918

Comments

emcake commented Nov 28, 2023 • edited Loading

Description

thomasfrederikhoeck commented Nov 29, 2023

ion-elgreco commented Nov 29, 2023

emcake commented Nov 28, 2023 •

edited

Loading