Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use file statistics in query planning to avoid sorting when unecessary #7490

Closed
Tracked by #10313
suremarc opened this issue Sep 7, 2023 · 3 comments · Fixed by #9593
Closed
Tracked by #10313

Use file statistics in query planning to avoid sorting when unecessary #7490

suremarc opened this issue Sep 7, 2023 · 3 comments · Fixed by #9593
Labels
enhancement New feature or request

Comments

@suremarc
Copy link
Contributor

suremarc commented Sep 7, 2023

Is your feature request related to a problem or challenge?

Related issue: #6672

DataFusion currently cannot avoid sorts when there are more files than there are target_partitions.

When querying data from an object store, DataFusion will attempt to group files into file groups internally, which are executed concurrently with one another. However, if any file group contains multiple files, the file_sort_order of the files cannot be maintained at present.

To illustrate, suppose we had a table trades comprised of the following files in S3:

❯ mc ls s3/trades/year=2023/month=01 -r
[2023-05-23 15:13:18 CDT] 1.2GiB STANDARD day=03/trades-2023-01-03.parquet
[2023-05-23 15:12:36 CDT] 1.2GiB STANDARD day=04/trades-2023-01-04.parquet
[2023-05-23 15:12:00 CDT] 1.1GiB STANDARD day=05/trades-2023-01-05.parquet
[2023-05-23 15:13:21 CDT] 1.2GiB STANDARD day=06/trades-2023-01-06.parquet
[2023-05-23 15:12:40 CDT] 1.2GiB STANDARD day=09/trades-2023-01-09.parquet

and their sort order was timestamp ASC. If our target_partitions is set to 3, the following query:

SELECT * 
FROM trades 
ORDER BY timestamp ASC
LIMIT 50000;

results in the following suboptimal plan:

GlobalLimitExec: skip=0, fetch=50000
  SortPreservingMergeExec: [timestamp@0 ASC NULLS LAST], fetch=50000
    SortExec: fetch=50000, expr=[timestamp@0 ASC NULLS LAST]
      ParquetExec: file_groups={3 groups: [[year=2023/month=01/day=05/trades-2023-01-05.parquet, year=2023/month=01/day=06/trades-2023-01-06.parquet], [year=2023/month=01/day=03/trades-2023-01-03.parquet, year=2023/month=01/day=09/trades-2023-01-09.parquet], [year=2023/month=01/day=04/trades-2023-01-04.parquet]]}, projection=[ticker, timestamp, participant_timestamp, trf_timestamp, sequence_number, conditions, id, price, size, correction, exchange, trf, tape, year, month, day]

DataFusion has decided that this plan needs a sort, because some file groups have multiple files. (This could be avoided if there was only 1 or even up to 3 files.) However, in this case we know that trades-2023-01-03.parquet could be streamed before trades-2023-01-04.parquet in timestamp order, because every timestamp in trades-2023-01-03.parquet precedes every timestamp in trades-2023-01-04.parquet. In fact, the physical plan shown above is ordered -- but DataFusion does not know this and currently has no way to know this.

Describe the solution you'd like

Essentially, DataFusion should be able to detect which files are non-overlapping, and use this to intelligently distribute files into file groups in such a way that still outputs data in order. Below I offer one possible path to doing so, which I believe should be minimally invasive.

At a minimum, PartitionedFile should have an additional optional field, statistics, which contains a Statistics object with the min/max statistics for that file. FileScanConfig::project should be changed to detect when files within a file group are distributed in order. Lastly, a physical optimizer to redistribute file groups to be ordered may be necessary to take advantage of this in some cases.

This does not solve the issue of how to feed file-level statistics into DataFusion, but users may add extensions to DataFusion that do so -- for example a custom TableProvider could do this. However, it should be feasible to integrate this feature into ListingTable. In fact, when collect_statistics is enabled, the ListingTable already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column.

Describe alternatives you've considered

At my company, we created a custom FileFormat implementation that outputs a wrapped ParquetExec with the output_ordering() method overrided, and the files redistributed to be in-order. However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of what FileFormat is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints.

Additional context

No response

@suremarc suremarc added the enhancement New feature or request label Sep 7, 2023
@alamb
Copy link
Contributor

alamb commented Sep 18, 2023

In fact, when collect_statistics is enabled, the ListingTable already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column.

FYI @Ted-Jiang has added some ability to reuse the statistics: #7570

Describe alternatives you've considered
At my company, we created a custom FileFormat implementation that outputs a wrapped ParquetExec with the output_ordering() method overrided, and the files redistributed to be in-order.

FWIW we implemented something similar in IOx https://github.com/influxdata/influxdb_iox

However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of what FileFormat is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints.

I agree it would be very nice to have a native way built into DataFusion

The solution you describe seems very reasonable to me. Thank you for writing it up

@matthewmturner
Copy link
Contributor

I can look into this.

@alamb since this issue has been created are you aware of any work that has been completed that would impact this / the proposed solution?

@suremarc
Copy link
Contributor Author

@matthewmturner I've done some work on this already, though it hasn't been touched in awhile. I think I might as well polish off what I have and commit to getting a PR in this week.

@alamb alamb changed the title Use file statistics in query planning Use file statistics in query planning to avoid sorting when unecessary Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants