-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Write parquet page index #34053
Comments
Just curious, would page index optimization be added to the Arrow interface in the long term after the low level reader/writer are finished? I'd expect that also requires change the I/O unit from row group to page. |
Sounds ok, but seems it requires high performance IO-merging and requires some benchmarks/testing |
There is already IO coalesce, but its range unit is ColumnChunk . arrow/cpp/src/arrow/io/caching.cc Line 175 in 39bad54
Perhaps it is not necessary to breakdown the IO to page, since Parquet-format states ColumnChunk is the IO unit. |
Yes, although the standard says so, but we can use it. Currently parquet-cpp implemention use both Page-IO and Chunk-IO:
Besides, I think currently the implemention of |
Page index is a pretty new concept in the history of parquet specs. In the old days when page index is unavailable, the I/O unit is always column chunk because we can only drop row groups based on the column statistics. Once page index is supported, finer-grained I/O unit is a must if we want to leverage page index to efficiently skip specific pages. @XinyuZeng |
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: #34053 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
Describe the enhancement requested
Parquet C++ reader has supported reading page index from file. Now it is time to implement the write logic.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: