feat: Use file cache to list partitions if available #9655

henrifroese · 2024-03-17T18:16:14Z

When discovering partitions for pruning, if we specify no partition columns, we call list_all_files, which uses the list_files_cache if it exists and is filled.

If we specify partition columns, before this change, we recursively list files in the object store to discover partitions. That happens on every request, and listing files e.g. in AWS S3 can be slow (especially if it's 100k+).

With this change, if the list_files_cache exists and is filled, we get all files from there and use that to discover partitions.

Closes #9654.

When discovering partitions for pruning, if we specify no partition columns, we call `list_all_files`, which uses the `list_files_cache` if it exists and is filled. If we specify partition columns, before this change, we recursively list files in the object store to discover partitions. That happens on every request, and listing files e.g. in AWS S3 can be slow (especially if it's 100k+). With this change, if the `list_files_cache` exists and is filled, we get all files from there and use that to discover partitions. Closes apache#9654

henrifroese · 2024-03-17T18:17:26Z

Note I'm an absolute rust noob :)

suremarc

I'm a bit worried about the performance of this approach, I think with enough files it might even be slower than just listing from S3 (which is quite slow, as you point out) due to the O(N^2) growth, which would not make it a strict improvement.

On the other hand it would be great if we had benchmarks exercising ListingTable with a large number of files -- 100s of thousands, as you mentioned in the issue. My team has found it to be a pain point with DataFusion when trying to keep planning times under 10 ms

suremarc · 2024-03-18T20:27:03Z

datafusion/core/src/datasource/listing/helpers.rs

@@ -168,24 +178,154 @@ struct Partition {
    files: Option<Vec<ObjectMeta>>,
 }

+#[derive(Debug, Default)]
+struct ObjectMetaLister {
+    objects: Arc<Vec<ObjectMeta>>,


Personally I would I recommend a trie for this use case, as list_with_delimiter called individually on every single partition, which means this code is going to perform O(N^2) in the worst case. I used sequence_trie for this in my own object store cache implementation.

IMO it would be ideal if the ListFilesCache itself returned a trie instead of a Vec -- then no conversion will need to happen at all.

alamb · 2024-04-06T09:12:36Z

Thank you @henrifroese for this contribution. It is an excellent issue to identify

Update here is I think a more holistic review of how ListingTable works and what we want out of it might be in order -- I started #9964 to discuss

github-actions · 2024-06-06T01:47:47Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added the core Core DataFusion crate label Mar 17, 2024

henrifroese mentioned this pull request Mar 17, 2024

Partitioned object store lists all files on every query when using hive-partitioned parquet files #9654

Open

alamb mentioned this pull request Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

suremarc reviewed Mar 18, 2024

View reviewed changes

alamb mentioned this pull request Mar 25, 2024

DataFusion weekly project plan (Andrew Lamb) - March 25, 2024 #9796

Closed

6 tasks

alamb mentioned this pull request Apr 1, 2024

DataFusion weekly project plan (Andrew Lamb) - April 1, 2024 #9899

Closed

7 tasks

github-actions bot added the Stale PR has not had any activity for some time label Jun 6, 2024

github-actions bot closed this Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use file cache to list partitions if available #9655

feat: Use file cache to list partitions if available #9655

henrifroese commented Mar 17, 2024

henrifroese commented Mar 17, 2024

suremarc left a comment

suremarc Mar 18, 2024

alamb commented Apr 6, 2024

github-actions bot commented Jun 6, 2024

feat: Use file cache to list partitions if available #9655

feat: Use file cache to list partitions if available #9655

Conversation

henrifroese commented Mar 17, 2024

henrifroese commented Mar 17, 2024

suremarc left a comment

Choose a reason for hiding this comment

suremarc Mar 18, 2024

Choose a reason for hiding this comment

alamb commented Apr 6, 2024

github-actions bot commented Jun 6, 2024