-
Notifications
You must be signed in to change notification settings - Fork 221
Added support for reading indexed parquet pages #923
Conversation
0d6629b
to
aa6e4b7
Compare
Codecov Report
@@ Coverage Diff @@
## main #923 +/- ##
==========================================
+ Coverage 71.18% 71.24% +0.06%
==========================================
Files 346 351 +5
Lines 18956 19342 +386
==========================================
+ Hits 13494 13781 +287
- Misses 5462 5561 +99
Continue to review full report at Codecov.
|
5e986bb
to
62f889d
Compare
Read and write indexes is now complete. The next and last step is to make deserialization work with offsetted pages. The last commit adds support for this for required primitives and binary pages as well as a test against this. The tedious work of supporting offsetted pages for the remaining page representations begins.. 🚂🚂🚂🚂 |
|
09d97f3
to
9059049
Compare
After 2 or 3 re-writes on both parquet2 and arrow2, I think that the design is finally here. There is no performance degradation in using indexes - indexes only improve performance. Indexed reads are not yet supported for nested types (I think that this is also true in parquet-mr) - this is feasible, but requires a bit more work since the indexed reading requires skipping values, and this must be correctly done in nested types. parquet2 still does not have auxiliary iterators for this. |
1e8256a
to
bf3965d
Compare
bf3965d
to
c6be676
Compare
728698f
to
e98448a
Compare
This is still not ready for end-users (we need to pipe the API to the frontend), but it does no harm and is heavily tested. Thus, merging as is. |
This PR adds support for bloom filters (read) and column/offset indexes (read and write).
Work needed:
The writing of offsets works out of the box and is active whenever the option
write_statistics
is active, since page offsets are just page-level statistics written outside of a page header.See jorgecarleitao/parquet2#102 and jorgecarleitao/parquet2#107 for more details