Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support for reading indexed parquet pages #923

Merged
merged 14 commits into from
Apr 15, 2022
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Mar 20, 2022

This PR adds support for bloom filters (read) and column/offset indexes (read and write).

Work needed:

  • map parquet types to arrow types to be able to map min and max in the column indexes to arrow, just like we do for statistics.
  • generalize deserialization to support offsetted parquet pages [difficult]

The writing of offsets works out of the box and is active whenever the option write_statistics is active, since page offsets are just page-level statistics written outside of a page header.

See jorgecarleitao/parquet2#102 and jorgecarleitao/parquet2#107 for more details

@jorgecarleitao jorgecarleitao added the feature A new feature label Mar 20, 2022
@jorgecarleitao jorgecarleitao force-pushed the parquet2_migrate branch 2 times, most recently from 0d6629b to aa6e4b7 Compare March 21, 2022 06:40
@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #923 (c584cd4) into main (fafc70d) will increase coverage by 0.06%.
The diff coverage is 63.37%.

@@            Coverage Diff             @@
##             main     #923      +/-   ##
==========================================
+ Coverage   71.18%   71.24%   +0.06%     
==========================================
  Files         346      351       +5     
  Lines       18956    19342     +386     
==========================================
+ Hits        13494    13781     +287     
- Misses       5462     5561      +99     
Impacted Files Coverage Δ
src/error.rs 35.71% <0.00%> (-2.75%) ⬇️
src/io/parquet/mod.rs 0.00% <0.00%> (ø)
src/io/parquet/read/deserialize/mod.rs 67.24% <ø> (ø)
src/io/parquet/read/mod.rs 100.00% <ø> (ø)
src/io/parquet/read/schema/mod.rs 100.00% <ø> (ø)
src/io/parquet/read/deserialize/simple.rs 39.61% <18.75%> (-0.27%) ⬇️
src/io/parquet/read/indexes/primitive.rs 25.97% <25.97%> (ø)
src/io/parquet/read/deserialize/binary/basic.rs 63.69% <41.86%> (-17.50%) ⬇️
src/io/parquet/read/statistics/primitive.rs 60.00% <43.75%> (+2.00%) ⬆️
src/io/parquet/read/deserialize/boolean/nested.rs 62.74% <45.45%> (+10.20%) ⬆️
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fafc70d...c584cd4. Read the comment docs.

@jorgecarleitao jorgecarleitao force-pushed the parquet2_migrate branch 9 times, most recently from 5e986bb to 62f889d Compare March 22, 2022 22:55
@jorgecarleitao
Copy link
Owner Author

Read and write indexes is now complete.

The next and last step is to make deserialization work with offsetted pages. The last commit adds support for this for required primitives and binary pages as well as a test against this.

The tedious work of supporting offsetted pages for the remaining page representations begins.. 🚂🚂🚂🚂

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Mar 23, 2022

  • flat
  • flat dictionary-encoded
  • flat dictionary-encoded to dictionary
  • nested deferred to a future PR (e.g. spark does not support it yet)

@jorgecarleitao jorgecarleitao marked this pull request as ready for review March 25, 2022 06:22
@jorgecarleitao jorgecarleitao force-pushed the parquet2_migrate branch 3 times, most recently from 09d97f3 to 9059049 Compare April 11, 2022 21:22
@jorgecarleitao
Copy link
Owner Author

After 2 or 3 re-writes on both parquet2 and arrow2, I think that the design is finally here.

There is no performance degradation in using indexes - indexes only improve performance.

Indexed reads are not yet supported for nested types (I think that this is also true in parquet-mr) - this is feasible, but requires a bit more work since the indexed reading requires skipping values, and this must be correctly done in nested types. parquet2 still does not have auxiliary iterators for this.

@jorgecarleitao jorgecarleitao changed the title migrate to latest parquet2 (add support for bloom filters and page indexes) Migrate to latest parquet2 Apr 15, 2022
@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Apr 15, 2022

This is still not ready for end-users (we need to pipe the API to the frontend), but it does no harm and is heavily tested. Thus, merging as is.

@jorgecarleitao jorgecarleitao merged commit cd5a4c0 into main Apr 15, 2022
@jorgecarleitao jorgecarleitao deleted the parquet2_migrate branch April 15, 2022 09:01
@jorgecarleitao jorgecarleitao changed the title Migrate to latest parquet2 Added support for reading indexed parquet pages Apr 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant