-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36882: [C++][Parquet] Use RLE as BOOLEAN default encoding when both data page and version is V2 #38163
Conversation
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met | |||
Encoding::type encoding = properties->encoding(descr->path()); | |||
if (encoding == Encoding::UNKNOWN) { | |||
encoding = (descr->physical_type() == Type::BOOLEAN && | |||
properties->version() != ParquetVersion::PARQUET_1_0) | |||
properties->version() != ParquetVersion::PARQUET_1_0 && | |||
properties->data_page_version() == ParquetDataPageVersion::V2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, if the data page version is v2, it is user's responsibility to set parquet version to PARQUET_2_X. Unfortunately, this is not enforced yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we prevent from this or just keep the behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it separately.
@mapleFU There are CI failures. |
@pitrou @jorisvandenbossche @wgtmac Would you mind take a look? Would this ok? Or just write RLE as default on V1 data page? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Can you update the PR description to explain the rationale for this change? |
this->SetUpSchema(Repetition::REQUIRED); | ||
auto writer = this->BuildWriter(SMALL_SIZE, ColumnProperties(), version, | ||
ParquetDataPageVersion::V1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not data_page_version
?
TestWithEncoding(ParquetVersion::PARQUET_2_4, ParquetDataPageVersion::V1, | ||
Encoding::PLAIN); | ||
TestWithEncoding(ParquetVersion::PARQUET_2_4, ParquetDataPageVersion::V2, | ||
Encoding::RLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, RLE
is always in the encodings regardless, since it's used to RL/DL, right?
I've resolve the comment and update the description @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @mapleFU
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 64c7cae. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Only use RLE as BOOLEAN default encoding when data page is V2.
Previous patch ( #36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:
What changes are included in this PR?
Only use RLE as BOOLEAN default encoding when both data page and version is V2.
Are these changes tested?
Yes
Are there any user-facing changes?
RLE encoding change for Boolean.