Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate redundant "prefix" sorts #9812

Closed
suremarc opened this issue Mar 26, 2024 · 0 comments · Fixed by #9813
Closed

Eliminate redundant "prefix" sorts #9812

suremarc opened this issue Mar 26, 2024 · 0 comments · Fixed by #9813
Labels
enhancement New feature or request

Comments

@suremarc
Copy link
Contributor

suremarc commented Mar 26, 2024

Is your feature request related to a problem or challenge?

I have a table partitioned by (date) and with a file sort order of (ticker, timestamp). Importantly, it is known that date is equivalent to CAST(timestamp AS DATE). After #9612, filtering on ticker eliminates it from the sort. So, in the presence of a constant ticker filter, sorting on date, timestamp is equivalent to sorting on timestamp.

Describe the solution you'd like

Given an ordering (date, ticker, timestamp), I would like for DataFusion to not require a sort when I query for data with a constant ticker, sorted by timestamp. So something like this:

CREATE UNBOUNDED EXTERNAL TABLE data (
    "date"      DATE, 
    "ticker"    VARCHAR, 
    "timestamp" TIMESTAMP,
) STORED AS CSV
WITH ORDER ("date", "ticker", "timestamp")
LOCATION './a.parquet';

explain SELECT * FROM data 
WHERE ticker = 'A' AND date = CAST(timestamp AS DATE)
ORDER BY "timestamp";

should not require a sort.

This is a somewhat contrived example, in reality there would be a custom physical plan that inserts date = CAST(timestamp AS DATE) into the equivalence properties, instead of adding it as a filter predicate.

Describe alternatives you've considered

One possibility would be to make the time column HHMMSS -- then we have a somewhat normalized representation where date and time are disjoint parts of a timestamp. However, this is less flexible and basically requires us to expose the partitioning to users querying the table. It also makes us forgo the use of the arrow-native timestamp type.

Another possibility is to create a physical optimizer and a custom physical plan that overrides the sort ordering, but I would prefer if this knowledge was baked directly into DataFusion.

Additional context

I know I've been making a lot of feature requests for sort-based optimizations lately, hopefully this is the last one for awhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant