Skip to content

Commit

Permalink
Limit max row group size when writing Parquet files
Browse files Browse the repository at this point in the history
The Parquet writer keeps a whole row group buffered in memory before writing it
out to the output stream, which is ~1M rows by default. Limit the group size to
65536 rows to mitigate this.
  • Loading branch information
mildbyte committed Aug 28, 2022
1 parent 4b89565 commit c865d20
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,12 @@ use crate::{
// with DataFusion's object store registry.
pub const INTERNAL_OBJECT_STORE_SCHEME: &str = "seafowl";

// Max Parquet row group size, in rows. This is what the ArrowWriter uses to determine how many
// rows to buffer in memory before flushing them out to disk. The default for this is 1024^2, which
// means that we're effectively buffering a whole partition in memory, causing issues on RAM-limited
// environments.
const MAX_ROW_GROUP_SIZE: usize = 65536;

pub fn internal_object_store_url() -> ObjectStoreUrl {
ObjectStoreUrl::parse(format!("{}://", INTERNAL_OBJECT_STORE_SCHEME)).unwrap()
}
Expand Down Expand Up @@ -210,7 +216,9 @@ fn temp_partition_file_writer(
DataFusionError::Execution("Error with temporary Parquet file".to_string())
})?;

let writer_properties = WriterProperties::builder().build();
let writer_properties = WriterProperties::builder()
.set_max_row_group_size(MAX_ROW_GROUP_SIZE)
.build();
let writer =
ArrowWriter::try_new(partition_file, arrow_schema, Some(writer_properties))?;
Ok((partition_file_handle, writer))
Expand Down

0 comments on commit c865d20

Please sign in to comment.