Skip to content

Commit

Permalink
[BugFix] Fix arrow parquet exception handling (#25621)
Browse files Browse the repository at this point in the history
Fixes #23606, where memory leaks are introduced by
`_chunk_writer->release()`

The modification on Arrow is to make sure
`RowGroupSerializer::column_writers_` is cleaned up even if exception
throws.

You may refer apache/arrow#35520 for more
details.

Signed-off-by: Letian Jiang <[email protected]>
  • Loading branch information
letian-jiang authored Jun 20, 2023
1 parent c1a3351 commit f610fdf
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 5 deletions.
4 changes: 0 additions & 4 deletions be/src/formats/parquet/file_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -377,11 +377,7 @@ Status SyncFileWriter::_flush_row_group() {
_chunk_writer->close();
} catch (const ::parquet::ParquetStatusException& e) {
_chunk_writer.reset();

// this is to avoid calling ParquetFileWriter.Close which incurs segfault
_closed = true;
_writer.release();

auto st = Status::IOError(fmt::format("{}: {}", "flush rowgroup error", e.what()));
LOG(WARNING) << st;
return st;
Expand Down
8 changes: 7 additions & 1 deletion thirdparty/download-thirdparty.sh
Original file line number Diff line number Diff line change
Expand Up @@ -477,13 +477,19 @@ fi
echo "Finished patching $SERDES_SOURCE"
cd -

# patch arrows to use our built jemalloc
# patch arrows
if [[ -d $TP_SOURCE_DIR/$ARROW_SOURCE ]] ; then
cd $TP_SOURCE_DIR/$ARROW_SOURCE
# use our built jemalloc
if [ ! -f $PATCHED_MARK ] && [ $ARROW_SOURCE = "arrow-apache-arrow-5.0.0" ] ; then
patch -p1 < $TP_PATCH_DIR/arrow-5.0.0-force-use-external-jemalloc.patch
touch $PATCHED_MARK
fi
# fix arrow parquet exception handling
if [ ! -f $PATCHED_MARK ] && [ $ARROW_SOURCE = "arrow-apache-arrow-5.0.0" ] ; then
patch -p1 < $TP_PATCH_DIR/arrow-5.0.0-fix-exception-handling.patch
touch $PATCHED_MARK
fi
cd -
echo "Finished patching $ARROW_SOURCE"
fi
21 changes: 21 additions & 0 deletions thirdparty/patches/arrow-5.0.0-fix-exception-handling.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
diff --git a/cpp/src/parquet/file_writer.cc b/cpp/src/parquet/file_writer.cc
index deac9586e..836f9e93c 100644
--- a/cpp/src/parquet/file_writer.cc
+++ b/cpp/src/parquet/file_writer.cc
@@ -181,15 +181,13 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
closed_ = true;
CheckRowsWritten();

+ auto column_writers = std::move(column_writers_);
for (size_t i = 0; i < column_writers_.size(); i++) {
if (column_writers_[i]) {
total_bytes_written_ += column_writers_[i]->Close();
- column_writers_[i].reset();
}
}

- column_writers_.clear();
-
// Ensures all columns have been written
metadata_->set_num_rows(num_rows_);
metadata_->Finish(total_bytes_written_, row_group_ordinal_);

0 comments on commit f610fdf

Please sign in to comment.