Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create page and column statistics when a parquet file is written in parallel #7589

Closed
alamb opened this issue Sep 18, 2023 · 0 comments · Fixed by #7655
Closed

Create page and column statistics when a parquet file is written in parallel #7589

alamb opened this issue Sep 18, 2023 · 0 comments · Fixed by #7655
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 18, 2023

Is your feature request related to a problem or challenge?

In #7562 @devinjdangelo added the (really neat) feature to write a single parquet file in parallel.

This feature is enabled by a feature flag (`allow_single_file_parallelism), that defaults to off.

We haven't turned it on by default yet because the resulting parquet files don't have the necessary index structures (bloom filter, column_index, and offset_index) needed for high performance (see details in this conversation https://github.com/apache/arrow-datafusion/pull/7562/files#r1327037733)

Describe the solution you'd like

I would like the created parquet files to have the necessary index structures -- apache/arrow-rs#4823 tracks adding such an API upstream in arrow-rs.

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Sep 18, 2023
@alamb alamb changed the title Write out page and column statistics when Create page and column statistics when a parquet file is written in parallel Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
1 participant