You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
I have a table partitioned by (date) and with a file sort order of (ticker, timestamp). Importantly, it is known that date is equivalent to CAST(timestamp AS DATE). After #9612, filtering on ticker eliminates it from the sort. So, in the presence of a constant ticker filter, sorting on date, timestamp is equivalent to sorting on timestamp.
Describe the solution you'd like
Given an ordering (date, ticker, timestamp), I would like for DataFusion to not require a sort when I query for data with a constant ticker, sorted by timestamp. So something like this:
CREATE UNBOUNDED EXTERNAL TABLE data (
"date"DATE,
"ticker"VARCHAR,
"timestamp"TIMESTAMP,
) STORED AS CSV
WITH ORDER ("date", "ticker", "timestamp")
LOCATION './a.parquet';
explain SELECT*FROM data
WHERE ticker ='A'ANDdate= CAST(timestampASDATE)
ORDER BY"timestamp";
should not require a sort.
This is a somewhat contrived example, in reality there would be a custom physical plan that inserts date = CAST(timestamp AS DATE) into the equivalence properties, instead of adding it as a filter predicate.
Describe alternatives you've considered
One possibility would be to make the time column HHMMSS -- then we have a somewhat normalized representation where date and time are disjoint parts of a timestamp. However, this is less flexible and basically requires us to expose the partitioning to users querying the table. It also makes us forgo the use of the arrow-native timestamp type.
Another possibility is to create a physical optimizer and a custom physical plan that overrides the sort ordering, but I would prefer if this knowledge was baked directly into DataFusion.
Additional context
I know I've been making a lot of feature requests for sort-based optimizations lately, hopefully this is the last one for awhile.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge?
I have a table partitioned by
(date)
and with a file sort order of(ticker, timestamp)
. Importantly, it is known thatdate
is equivalent toCAST(timestamp AS DATE)
. After #9612, filtering onticker
eliminates it from the sort. So, in the presence of a constantticker
filter, sorting ondate, timestamp
is equivalent to sorting ontimestamp
.Describe the solution you'd like
Given an ordering
(date, ticker, timestamp)
, I would like for DataFusion to not require a sort when I query for data with a constantticker
, sorted bytimestamp
. So something like this:should not require a sort.
This is a somewhat contrived example, in reality there would be a custom physical plan that inserts
date = CAST(timestamp AS DATE)
into the equivalence properties, instead of adding it as a filter predicate.Describe alternatives you've considered
One possibility would be to make the time column HHMMSS -- then we have a somewhat normalized representation where
date
andtime
are disjoint parts of a timestamp. However, this is less flexible and basically requires us to expose the partitioning to users querying the table. It also makes us forgo the use of the arrow-native timestamp type.Another possibility is to create a physical optimizer and a custom physical plan that overrides the sort ordering, but I would prefer if this knowledge was baked directly into DataFusion.
Additional context
I know I've been making a lot of feature requests for sort-based optimizations lately, hopefully this is the last one for awhile.
The text was updated successfully, but these errors were encountered: