-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Should we prevent from writing empty rowgroup in WriteRecordBatch
#39210
Comments
The fixing can be as simple as: diff --git a/cpp/src/parquet/arrow/writer.cc b/cpp/src/parquet/arrow/writer.cc
index 07c627d5e..dd0f5be07 100644
--- a/cpp/src/parquet/arrow/writer.cc
+++ b/cpp/src/parquet/arrow/writer.cc
@@ -462,7 +462,7 @@ class FileWriterImpl : public FileWriter {
offset += batch_size;
// Flush current row group if it is full.
- if (row_group_writer_->num_rows() >= max_row_group_length) {
+ if (row_group_writer_->num_rows() >= max_row_group_length && offset < batch.num_rows()) {
RETURN_NOT_OK(NewBufferedRowGroup());
}
} |
pitrou
added a commit
that referenced
this issue
Dec 13, 2023
…ed RowGroup (#39211) ### Rationale for this change `WriteRecordBatch` might produce zero-sized row-group, which is mentioned in #39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup. ### What changes are included in this PR? adding a check for zero-sized row-group ### Are these changes tested? Yes ### Are there any user-facing changes? no * Closes: #39210 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
clayburn
pushed a commit
to clayburn/arrow
that referenced
this issue
Jan 23, 2024
…ro-sized RowGroup (apache#39211) ### Rationale for this change `WriteRecordBatch` might produce zero-sized row-group, which is mentioned in apache#39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup. ### What changes are included in this PR? adding a check for zero-sized row-group ### Are these changes tested? Yes ### Are there any user-facing changes? no * Closes: apache#39210 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…ro-sized RowGroup (apache#39211) ### Rationale for this change `WriteRecordBatch` might produce zero-sized row-group, which is mentioned in apache#39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup. ### What changes are included in this PR? adding a check for zero-sized row-group ### Are these changes tested? Yes ### Are there any user-facing changes? no * Closes: apache#39210 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Arrow Parquet Writer uses
AppendRowGroup
to produce the RowGroup during writing a Parquet file. The code path is listed inFileWriterImpl::WriteRecordBatch
as below:Assume the
max_row_group_length == k
, if inputrecordBatch.num_rows() % k == 0
, we would append an empty row-group.The behavior of empty row-group is a bit tricky. It looks like below. The empty row-group only have metadata, and data does not exists.
Although it's hard to prevent all cases from "empty row-group", maybe we can prevent the case for the last write to the File
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: