Skip to content

Commit

Permalink
fix(python): skip empty row groups during stats gathering (#2172)
Browse files Browse the repository at this point in the history
# Description
For some odd reason the pyarrow parquet writer will leave empty row
groups in the parquet file when it hits the max_row limit that's passed.
While grabbing the stats we were checking if all row_groups were having
stats added to them but these empty row groups had no stats so it causes
the whole file add action to get no stats recorded.

We now skip empty row groups while gathering the stats to prevent this.

In v0.15.2 we now also evaluate files with no stats mentioned as null
@roeap @rtyler not sure if this is entirely correct as well

# Related Issue(s)
- closes #2169

---------

Co-authored-by: Will Jones <[email protected]>
  • Loading branch information
ion-elgreco and wjones127 authored Feb 6, 2024
1 parent 40492fe commit 7b668aa
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 1 deletion.
3 changes: 2 additions & 1 deletion python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -666,7 +666,8 @@ def get_file_stats_from_metadata(

def iter_groups(metadata: Any) -> Iterator[Any]:
for i in range(metadata.num_row_groups):
yield metadata.row_group(i)
if metadata.row_group(i).num_rows > 0:
yield metadata.row_group(i)

for column_idx in range(metadata.num_columns):
name = metadata.row_group(0).column(column_idx).path_in_schema
Expand Down
22 changes: 22 additions & 0 deletions python/tests/test_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1251,3 +1251,25 @@ def test_with_deltalake_schema(tmp_path: pathlib.Path, sample_data: pa.Table):
)
delta_table = DeltaTable(tmp_path)
assert delta_table.schema().to_pyarrow() == sample_data.schema


def test_write_stats_empty_rowgroups(tmp_path: pathlib.Path):
# https://github.com/delta-io/delta-rs/issues/2169
data = pa.table(
{
"data": pa.array(["B"] * 1024 * 33),
}
)
write_deltalake(
tmp_path,
data,
max_rows_per_file=1024 * 32,
max_rows_per_group=1024 * 16,
min_rows_per_group=8 * 1024,
mode="overwrite",
)
dt = DeltaTable(tmp_path)
assert (
dt.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape[0]
== 33792
)

0 comments on commit 7b668aa

Please sign in to comment.