Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Write parquet page index #34053

Closed
wgtmac opened this issue Feb 6, 2023 · 5 comments · Fixed by #34054
Closed

[C++][Parquet] Write parquet page index #34053

wgtmac opened this issue Feb 6, 2023 · 5 comments · Fixed by #34054

Comments

@wgtmac
Copy link
Member

wgtmac commented Feb 6, 2023

Describe the enhancement requested

Parquet C++ reader has supported reading page index from file. Now it is time to implement the write logic.

Component(s)

C++, Parquet

@XinyuZeng
Copy link
Contributor

Just curious, would page index optimization be added to the Arrow interface in the long term after the low level reader/writer are finished? I'd expect that also requires change the I/O unit from row group to page.

@mapleFU
Copy link
Member

mapleFU commented Feb 8, 2023

Just curious, would page index optimization be added to the Arrow interface in the long term after the low level reader/writer are finished? I'd expect that also requires change the I/O unit from row group to page.

Sounds ok, but seems it requires high performance IO-merging and requires some benchmarks/testing

@XinyuZeng
Copy link
Contributor

Just curious, would page index optimization be added to the Arrow interface in the long term after the low level reader/writer are finished? I'd expect that also requires change the I/O unit from row group to page.

Sounds ok, but seems it requires high performance IO-merging and requires some benchmarks/testing

There is already IO coalesce, but its range unit is ColumnChunk .

ranges = internal::CoalesceReadRanges(std::move(ranges), options.hole_size_limit,

Perhaps it is not necessary to breakdown the IO to page, since Parquet-format states ColumnChunk is the IO unit.

@mapleFU
Copy link
Member

mapleFU commented Feb 8, 2023

Perhaps it is not necessary to breakdown the IO to page, since Parquet-format states ColumnChunk is the IO unit.

Yes, although the standard says so, but we can use it. Currently parquet-cpp implemention use both Page-IO and Chunk-IO:

  • arrow can use ReadRangeCache to serving chunk-level io, and, however, I don't think it provides an good performance. And currently, it will read whole buffer, and caching them in a ::arrow::io::BufferReader
  • If no cache is used, an ArrowInputStream would be created directly on input, and PageReader will try do create read buffer page-by-page

Besides, I think currently the implemention of ReadRangeCache is naive. I'm not sure it will works well.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 9, 2023

Page index is a pretty new concept in the history of parquet specs. In the old days when page index is unavailable, the I/O unit is always column chunk because we can only drop row groups based on the column statistics. Once page index is supported, finer-grained I/O unit is a must if we want to leverage page index to efficiently skip specific pages. @XinyuZeng

wgtmac added a commit to wgtmac/arrow that referenced this issue Feb 23, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Feb 24, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Feb 24, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Feb 24, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Feb 28, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Mar 7, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Apr 6, 2023
wjones127 pushed a commit that referenced this issue Apr 10, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: #34053

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
@wjones127 wjones127 added this to the 12.0.0 milestone Apr 10, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants