Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge: Filtering on partitions #1918

Closed
emcake opened this issue Nov 28, 2023 · 2 comments
Closed

Merge: Filtering on partitions #1918

emcake opened this issue Nov 28, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@emcake
Copy link
Contributor

emcake commented Nov 28, 2023

Description

I'd like merge to offer a partitions argument to reduce table churn and processing.

Use Case

Currently the merge operation consumes the whole table and merges in new source data. Currently this re-writes every file in the table: https://gist.github.com/emcake/acc1aa233339a5b3534e2f54702dd46e

For large tables where the updated surface area is small, this is inefficient. I'd propose we have a new parameter partitions: Optional[Union[List[PartitionValues], Literal['auto']]] = None which can provide a list of partitions for the merge operation to be restricted to.

Usage:

# current behaviour
table.merge(source_data, partitions=None) 

# restrict merge to files with the listed PartitionValues. If data in source_data is outside these partitions, it's dropped.
table.merge(source_data, partitions=[{ 'col_a' : 1, 'col_b' : 'foo' }, {'col_a' : 1, 'col_b' : 'bar'}, ...]) 

# use the table partition columns and find all distinct tuples of values in source_data
table.merge(source_data, partitions='auto') 

Related Issue(s)

@emcake emcake added the enhancement New feature or request label Nov 28, 2023
@thomasfrederikhoeck
Copy link
Contributor

Related to #1846 . I think ideally that the predicate is used like in the example instead of having a seperate keyword.

@ion-elgreco
Copy link
Collaborator

@thomasfrederikhoeck I agree, the predicate can be used together with statistics to do file pruning. @Blajda mentions this also in that issue.

Closing this as duplicate.

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants