Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38326: [C++][Parquet] check the decompressed page size same as size in page header #38327

Merged
merged 5 commits into from
Oct 19, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 18, 2023

Rationale for this change

As mentioned in issue, currently we only decompress the page without checking the decompress size.
This patch add a checkings.

What changes are included in this PR?

Throw exception when size not matches in SerializedPageReader::DecompressIfNeeded

Are these changes tested?

Yes.

Are there any user-facing changes?

Non-conforming files may throw an exception while they would silently return invalid results before.

@mapleFU mapleFU requested a review from wgtmac as a code owner October 18, 2023 13:30
@mapleFU
Copy link
Member Author

mapleFU commented Oct 18, 2023

cc @wgtmac @pitrou

@github-actions
Copy link

⚠️ GitHub issue #38326 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 18, 2023
@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

Can you perhaps try reading some third-party Parquet files with this, if you can find some?

@mapleFU
Copy link
Member Author

mapleFU commented Oct 18, 2023

Let me generate a hand-written compression file with "bad" header

@mapleFU mapleFU requested a review from pitrou October 18, 2023 17:08
@mapleFU mapleFU force-pushed the check-decompress-size branch from 4218afd to 98d047b Compare October 18, 2023 17:29
Comment on lines +913 to +914
// Some decompressor, like zstd, might be able to detect the error
// before checking the page size.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTD would throw an IOError: Corrupt ... here.

page_buffer->data() + levels_byte_len,
uncompressed_len - levels_byte_len,
decompression_buffer_->mutable_data() + levels_byte_len));
if (decompressed_len != uncompressed_len - levels_byte_len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement is that uncompressed_len from the page header is something that we can always trust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen the arrow-rs implemention, it check the size here. The main problem is that the size is not exact, the decoder will have bad tail buffer, like getting 0 in that multi-gzip file.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mapleFU , mostly LGTM.

cpp/src/parquet/file_deserialize_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/file_deserialize_test.cc Show resolved Hide resolved
cpp/src/parquet/file_deserialize_test.cc Show resolved Hide resolved
@mapleFU mapleFU requested a review from pitrou October 19, 2023 14:36
@pitrou pitrou merged commit 4712dab into apache:main Oct 19, 2023
31 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 19, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 4712dab.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…as size in page header (apache#38327)

### Rationale for this change

As mentioned in issue, currently we only decompress the page without checking the decompress size. 
This patch add a checkings.

### What changes are included in this PR?

Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Non-conforming files may throw an exception while they would silently return invalid results before.

* Closes: apache#38326

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…as size in page header (apache#38327)

### Rationale for this change

As mentioned in issue, currently we only decompress the page without checking the decompress size. 
This patch add a checkings.

### What changes are included in this PR?

Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Non-conforming files may throw an exception while they would silently return invalid results before.

* Closes: apache#38326

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…as size in page header (apache#38327)

### Rationale for this change

As mentioned in issue, currently we only decompress the page without checking the decompress size. 
This patch add a checkings.

### What changes are included in this PR?

Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Non-conforming files may throw an exception while they would silently return invalid results before.

* Closes: apache#38326

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Compression: check the decompressed page size same as size in page header
3 participants