-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38326: [C++][Parquet] check the decompressed page size same as size in page header #38327
Conversation
|
Can you perhaps try reading some third-party Parquet files with this, if you can find some? |
Let me generate a hand-written compression file with "bad" header |
4218afd
to
98d047b
Compare
// Some decompressor, like zstd, might be able to detect the error | ||
// before checking the page size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZSTD would throw an IOError: Corrupt ...
here.
page_buffer->data() + levels_byte_len, | ||
uncompressed_len - levels_byte_len, | ||
decompression_buffer_->mutable_data() + levels_byte_len)); | ||
if (decompressed_len != uncompressed_len - levels_byte_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement is that uncompressed_len from the page header is something that we can always trust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen the arrow-rs implemention, it check the size here. The main problem is that the size is not exact, the decoder will have bad tail buffer, like getting 0
in that multi-gzip file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mapleFU , mostly LGTM.
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 4712dab. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…as size in page header (apache#38327) ### Rationale for this change As mentioned in issue, currently we only decompress the page without checking the decompress size. This patch add a checkings. ### What changes are included in this PR? Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded` ### Are these changes tested? Yes. ### Are there any user-facing changes? Non-conforming files may throw an exception while they would silently return invalid results before. * Closes: apache#38326 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…as size in page header (apache#38327) ### Rationale for this change As mentioned in issue, currently we only decompress the page without checking the decompress size. This patch add a checkings. ### What changes are included in this PR? Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded` ### Are these changes tested? Yes. ### Are there any user-facing changes? Non-conforming files may throw an exception while they would silently return invalid results before. * Closes: apache#38326 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…as size in page header (apache#38327) ### Rationale for this change As mentioned in issue, currently we only decompress the page without checking the decompress size. This patch add a checkings. ### What changes are included in this PR? Throw exception when size not matches in `SerializedPageReader::DecompressIfNeeded` ### Are these changes tested? Yes. ### Are there any user-facing changes? Non-conforming files may throw an exception while they would silently return invalid results before. * Closes: apache#38326 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
As mentioned in issue, currently we only decompress the page without checking the decompress size.
This patch add a checkings.
What changes are included in this PR?
Throw exception when size not matches in
SerializedPageReader::DecompressIfNeeded
Are these changes tested?
Yes.
Are there any user-facing changes?
Non-conforming files may throw an exception while they would silently return invalid results before.