Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Verify 32-bit CRC checksum when decoding pages #6290

Merged
merged 16 commits into from
Sep 28, 2024

Conversation

xmakro
Copy link
Contributor

@xmakro xmakro commented Aug 22, 2024

Closes #6289

Please let me know if we should expose this in the reader APIs instead of a crate feature

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 22, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some tests phase

parquet/Cargo.toml Outdated Show resolved Hide resolved
parquet/src/errors.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold that this code needs to have some tests to ensure we don't break the feature in the future

Also I think the feature flag should be documented here https://crates.io/crates/parquet

@mapleFU
Copy link
Member

mapleFU commented Aug 27, 2024

FYI: https://github.com/apache/parquet-testing/tree/master/data
You can check the file with filename contains "checksum"

@xmakro
Copy link
Contributor Author

xmakro commented Sep 1, 2024

Thanks for the pointers. I added the tests and documented the feature flag. Please take a look

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Please run cargo +stable fmt --all and check in the result. Have you run any benchmarks to see if there is a measurable impact from the crc calculation?

parquet/Cargo.toml Outdated Show resolved Hide resolved
parquet/README.md Show resolved Hide resolved
parquet/src/file/serialized_reader.rs Outdated Show resolved Hide resolved
parquet/tests/arrow_reader/checksum.rs Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xmakro! Looks good to me. I'm fine with this as a feature, but will defer to others who may have a stronger opinion.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

Probably just need to clarify wording on the feature flag.

CI failure looks unrelated, will try look into it.

Edit: looks like CI failure was resolved by #6437. Merging in latest from master should resolve the issue

parquet/README.md Outdated Show resolved Hide resolved
parquet/src/file/serialized_reader.rs Show resolved Hide resolved
parquet/tests/arrow_reader/checksum.rs Show resolved Hide resolved
@xmakro
Copy link
Contributor Author

xmakro commented Sep 28, 2024

Thanks for the reviews! I applied the comments and merged master, PTAL

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

🚀

@alamb alamb merged commit 8e0aaad into apache:master Sep 28, 2024
17 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

Thanks everyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally verify 32-bit CRC checksum when decoding parquet pages
7 participants