Use file statistics in query planning to avoid sorting when unecessary #7490

suremarc · 2023-09-07T00:53:42Z

Is your feature request related to a problem or challenge?

Related issue: #6672

DataFusion currently cannot avoid sorts when there are more files than there are target_partitions.

When querying data from an object store, DataFusion will attempt to group files into file groups internally, which are executed concurrently with one another. However, if any file group contains multiple files, the file_sort_order of the files cannot be maintained at present.

To illustrate, suppose we had a table trades comprised of the following files in S3:

❯ mc ls s3/trades/year=2023/month=01 -r
[2023-05-23 15:13:18 CDT] 1.2GiB STANDARD day=03/trades-2023-01-03.parquet
[2023-05-23 15:12:36 CDT] 1.2GiB STANDARD day=04/trades-2023-01-04.parquet
[2023-05-23 15:12:00 CDT] 1.1GiB STANDARD day=05/trades-2023-01-05.parquet
[2023-05-23 15:13:21 CDT] 1.2GiB STANDARD day=06/trades-2023-01-06.parquet
[2023-05-23 15:12:40 CDT] 1.2GiB STANDARD day=09/trades-2023-01-09.parquet

and their sort order was timestamp ASC. If our target_partitions is set to 3, the following query:

SELECT * 
FROM trades 
ORDER BY timestamp ASC
LIMIT 50000;

results in the following suboptimal plan:

GlobalLimitExec: skip=0, fetch=50000
  SortPreservingMergeExec: [timestamp@0 ASC NULLS LAST], fetch=50000
    SortExec: fetch=50000, expr=[timestamp@0 ASC NULLS LAST]
      ParquetExec: file_groups={3 groups: [[year=2023/month=01/day=05/trades-2023-01-05.parquet, year=2023/month=01/day=06/trades-2023-01-06.parquet], [year=2023/month=01/day=03/trades-2023-01-03.parquet, year=2023/month=01/day=09/trades-2023-01-09.parquet], [year=2023/month=01/day=04/trades-2023-01-04.parquet]]}, projection=[ticker, timestamp, participant_timestamp, trf_timestamp, sequence_number, conditions, id, price, size, correction, exchange, trf, tape, year, month, day]

DataFusion has decided that this plan needs a sort, because some file groups have multiple files. (This could be avoided if there was only 1 or even up to 3 files.) However, in this case we know that trades-2023-01-03.parquet could be streamed before trades-2023-01-04.parquet in timestamp order, because every timestamp in trades-2023-01-03.parquet precedes every timestamp in trades-2023-01-04.parquet. In fact, the physical plan shown above is ordered -- but DataFusion does not know this and currently has no way to know this.

Describe the solution you'd like

Essentially, DataFusion should be able to detect which files are non-overlapping, and use this to intelligently distribute files into file groups in such a way that still outputs data in order. Below I offer one possible path to doing so, which I believe should be minimally invasive.

At a minimum, PartitionedFile should have an additional optional field, statistics, which contains a Statistics object with the min/max statistics for that file. FileScanConfig::project should be changed to detect when files within a file group are distributed in order. Lastly, a physical optimizer to redistribute file groups to be ordered may be necessary to take advantage of this in some cases.

This does not solve the issue of how to feed file-level statistics into DataFusion, but users may add extensions to DataFusion that do so -- for example a custom TableProvider could do this. However, it should be feasible to integrate this feature into ListingTable. In fact, when collect_statistics is enabled, the ListingTable already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column.

Describe alternatives you've considered

At my company, we created a custom FileFormat implementation that outputs a wrapped ParquetExec with the output_ordering() method overrided, and the files redistributed to be in-order. However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of what FileFormat is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints.

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-09-18T22:37:18Z

In fact, when collect_statistics is enabled, the ListingTable already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column.

FYI @Ted-Jiang has added some ability to reuse the statistics: #7570

Describe alternatives you've considered
At my company, we created a custom FileFormat implementation that outputs a wrapped ParquetExec with the output_ordering() method overrided, and the files redistributed to be in-order.

FWIW we implemented something similar in IOx https://github.com/influxdata/influxdb_iox

However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of what FileFormat is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints.

I agree it would be very nice to have a native way built into DataFusion

The solution you describe seems very reasonable to me. Thank you for writing it up

matthewmturner · 2024-03-11T23:47:03Z

I can look into this.

@alamb since this issue has been created are you aware of any work that has been completed that would impact this / the proposed solution?

suremarc · 2024-03-13T03:18:36Z

@matthewmturner I've done some work on this already, though it hasn't been touched in awhile. I think I might as well polish off what I have and commit to getting a PR in this week.

suremarc added the enhancement New feature or request label Sep 7, 2023

alamb mentioned this issue Nov 15, 2023

Epic: Statistics improvements #8227

Open

19 tasks

suremarc mentioned this issue Mar 13, 2024

feat: Determine ordering of file groups #9593

Merged

alamb changed the title ~~Use file statistics in query planning~~ Use file statistics in query planning to avoid sorting when unecessary Mar 22, 2024

alamb mentioned this issue Apr 30, 2024

[Epic] A Collection of Sort Based Optimizations #10313

Open

10 tasks

alamb closed this as completed in #9593 May 1, 2024

alamb mentioned this issue May 13, 2024

feat: Add ProgressiveEval operator #10490

Closed

alamb mentioned this issue Jun 6, 2024

Row groups are read out of order or with completely different values #10572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use file statistics in query planning to avoid sorting when unecessary #7490

Use file statistics in query planning to avoid sorting when unecessary #7490

suremarc commented Sep 7, 2023 •

edited

Loading

alamb commented Sep 18, 2023

matthewmturner commented Mar 11, 2024

suremarc commented Mar 13, 2024

Use file statistics in query planning to avoid sorting when unecessary #7490

Use file statistics in query planning to avoid sorting when unecessary #7490

Comments

suremarc commented Sep 7, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Sep 18, 2023

matthewmturner commented Mar 11, 2024

suremarc commented Mar 13, 2024

suremarc commented Sep 7, 2023 •

edited

Loading