-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use file statistics in query planning to avoid sorting when unecessary #7490
Comments
FYI @Ted-Jiang has added some ability to reuse the statistics: #7570
FWIW we implemented something similar in IOx https://github.com/influxdata/influxdb_iox
I agree it would be very nice to have a native way built into DataFusion The solution you describe seems very reasonable to me. Thank you for writing it up |
I can look into this. @alamb since this issue has been created are you aware of any work that has been completed that would impact this / the proposed solution? |
@matthewmturner I've done some work on this already, though it hasn't been touched in awhile. I think I might as well polish off what I have and commit to getting a PR in this week. |
Is your feature request related to a problem or challenge?
Related issue: #6672
DataFusion currently cannot avoid sorts when there are more files than there are
target_partitions
.When querying data from an object store, DataFusion will attempt to group files into file groups internally, which are executed concurrently with one another. However, if any file group contains multiple files, the
file_sort_order
of the files cannot be maintained at present.To illustrate, suppose we had a table
trades
comprised of the following files in S3:and their sort order was
timestamp ASC
. If ourtarget_partitions
is set to3
, the following query:results in the following suboptimal plan:
DataFusion has decided that this plan needs a sort, because some file groups have multiple files. (This could be avoided if there was only 1 or even up to 3 files.) However, in this case we know that
trades-2023-01-03.parquet
could be streamed beforetrades-2023-01-04.parquet
in timestamp order, because every timestamp intrades-2023-01-03.parquet
precedes every timestamp intrades-2023-01-04.parquet
. In fact, the physical plan shown above is ordered -- but DataFusion does not know this and currently has no way to know this.Describe the solution you'd like
Essentially, DataFusion should be able to detect which files are non-overlapping, and use this to intelligently distribute files into file groups in such a way that still outputs data in order. Below I offer one possible path to doing so, which I believe should be minimally invasive.
At a minimum,
PartitionedFile
should have an additional optional field,statistics
, which contains aStatistics
object with the min/max statistics for that file.FileScanConfig::project
should be changed to detect when files within a file group are distributed in order. Lastly, a physical optimizer to redistribute file groups to be ordered may be necessary to take advantage of this in some cases.This does not solve the issue of how to feed file-level statistics into DataFusion, but users may add extensions to DataFusion that do so -- for example a custom
TableProvider
could do this. However, it should be feasible to integrate this feature intoListingTable
. In fact, whencollect_statistics
is enabled, theListingTable
already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column.Describe alternatives you've considered
At my company, we created a custom
FileFormat
implementation that outputs a wrappedParquetExec
with theoutput_ordering()
method overrided, and the files redistributed to be in-order. However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of whatFileFormat
is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints.Additional context
No response
The text was updated successfully, but these errors were encountered: