Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column index writer for parquet #1935

Merged
merged 1 commit into from
Jun 30, 2022

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

part of #1777

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 24, 2022
@liukun4515 liukun4515 changed the title add column index writer for parquet [WIP] add column index writer for parquet Jun 24, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2022

Codecov Report

Merging #1935 (2ec0dfa) into master (9059cbf) will increase coverage by 0.12%.
The diff coverage is 88.34%.

❗ Current head 2ec0dfa differs from pull request most recent head a5faeb6. Consider uploading reports for the commit a5faeb6 to get more accurate results

@@            Coverage Diff             @@
##           master    #1935      +/-   ##
==========================================
+ Coverage   83.41%   83.54%   +0.12%     
==========================================
  Files         214      221       +7     
  Lines       57004    57395     +391     
==========================================
+ Hits        47551    47949     +398     
+ Misses       9453     9446       -7     
Impacted Files Coverage Δ
arrow/src/alloc/mod.rs 78.04% <ø> (ø)
arrow/src/array/array_dictionary.rs 91.53% <ø> (ø)
arrow/src/array/array_list.rs 96.16% <ø> (ø)
arrow/src/array/builder/decimal_builder.rs 76.81% <0.00%> (+7.62%) ⬆️
arrow/src/array/mod.rs 100.00% <ø> (ø)
arrow/src/buffer/immutable.rs 98.48% <ø> (ø)
arrow/src/bytes.rs 73.07% <ø> (ø)
arrow/src/datatypes/datatype.rs 65.42% <ø> (ø)
parquet/src/arrow/array_reader/mod.rs 88.23% <ø> (-2.54%) ⬇️
parquet/src/arrow/arrow_writer/mod.rs 97.53% <ø> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9059cbf...a5faeb6. Read the comment docs.

@liukun4515 liukun4515 force-pushed the parquet_column_index_write branch 2 times, most recently from ea799a9 to 798af4f Compare June 28, 2022 07:20
@liukun4515 liukun4515 changed the title [WIP] add column index writer for parquet add column index writer for parquet Jun 28, 2022
@liukun4515 liukun4515 marked this pull request as ready for review June 28, 2022 07:21
@liukun4515
Copy link
Contributor Author

@Ted-Jiang PTAL

@liukun4515
Copy link
Contributor Author

Do we need to add an option to control this feature in the WriterProperties struct?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very nice, I think it would be good to maintain a clearer separation between file-level metadata and the index metadata. Mixing the two not only leads to the mutability issues you've run into, but it also can make it hard to reason about what fields are populated when. Perhaps we could have something like a ColumnChunkIndex or something?

parquet/src/file/writer.rs Outdated Show resolved Hide resolved
parquet/src/file/writer.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/column/writer.rs Outdated Show resolved Hide resolved
parquet/src/column/writer.rs Outdated Show resolved Hide resolved
parquet/src/column/writer.rs Outdated Show resolved Hide resolved
@liukun4515 liukun4515 force-pushed the parquet_column_index_write branch 2 times, most recently from cf7cbfc to ffa3d68 Compare June 29, 2022 14:35
@liukun4515 liukun4515 requested a review from tustvold June 29, 2022 14:41
@liukun4515
Copy link
Contributor Author

@tustvold reset, PTAL

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, some minor nits and then this can go in 😄

parquet/src/column/writer.rs Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
@tustvold tustvold added the api-change Changes to the arrow API label Jun 29, 2022
null_pages: Vec<bool>,
min_values: Vec<Vec<u8>>,
max_values: Vec<Vec<u8>>,
// TODO: calc the order for all pages in this column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is useful for checking whether use page index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is useful for checking whether use page index

If the data in the pages are ordered by ascend or descend, we can use the binary search to accelerate the page filter.

Copy link
Contributor Author

@liukun4515 liukun4515 Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the boundaryorder is UNORDERED, we need to filter the page one by one.

@tustvold tustvold merged commit c47d14c into apache:master Jun 30, 2022
@tustvold
Copy link
Contributor

Thank you 🥇

@liukun4515 liukun4515 deleted the parquet_column_index_write branch June 30, 2022 11:27
@alamb alamb changed the title add column index writer for parquet Add column index writer for parquet Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants