Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use file cache to list partitions if available #9655

Closed
wants to merge 1 commit into from

Conversation

henrifroese
Copy link

When discovering partitions for pruning, if we specify no partition columns, we call list_all_files, which uses the list_files_cache if it exists and is filled.

If we specify partition columns, before this change, we recursively list files in the object store to discover partitions. That happens on every request, and listing files e.g. in AWS S3 can be slow (especially if it's 100k+).

With this change, if the list_files_cache exists and is filled, we get all files from there and use that to discover partitions.

Closes #9654.

When discovering partitions for pruning, if we specify
no partition columns, we call `list_all_files`, which
uses the `list_files_cache` if it exists and is filled.

If we specify partition columns, before this change, we
recursively list files in the object store to discover
partitions. That happens on every request, and listing
files e.g. in AWS S3 can be slow (especially if it's
100k+).

With this change, if the `list_files_cache` exists and is
filled, we get all files from there and use that to discover
partitions.

Closes apache#9654
@github-actions github-actions bot added the core Core DataFusion crate label Mar 17, 2024
@henrifroese
Copy link
Author

Note I'm an absolute rust noob :)

Copy link
Contributor

@suremarc suremarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about the performance of this approach, I think with enough files it might even be slower than just listing from S3 (which is quite slow, as you point out) due to the O(N^2) growth, which would not make it a strict improvement.

On the other hand it would be great if we had benchmarks exercising ListingTable with a large number of files -- 100s of thousands, as you mentioned in the issue. My team has found it to be a pain point with DataFusion when trying to keep planning times under 10 ms

@@ -168,24 +178,154 @@ struct Partition {
files: Option<Vec<ObjectMeta>>,
}

#[derive(Debug, Default)]
struct ObjectMetaLister {
objects: Arc<Vec<ObjectMeta>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would I recommend a trie for this use case, as list_with_delimiter called individually on every single partition, which means this code is going to perform O(N^2) in the worst case. I used sequence_trie for this in my own object store cache implementation.

IMO it would be ideal if the ListFilesCache itself returned a trie instead of a Vec -- then no conversion will need to happen at all.

@alamb
Copy link
Contributor

alamb commented Apr 6, 2024

Thank you @henrifroese for this contribution. It is an excellent issue to identify

Update here is I think a more holistic review of how ListingTable works and what we want out of it might be in order -- I started #9964 to discuss

Copy link

github-actions bot commented Jun 6, 2024

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 6, 2024
@github-actions github-actions bot closed this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partitioned object store lists all files on every query when using hive-partitioned parquet files
3 participants