You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using deltalake version 0.19.1 and trying to make it so the parquet files in my deltalake table have a large number of row groups. I tried setting min_row_groups = 10000 and max_row_groups = 100000 for a 1000000 row table but I get a single row group.
Specifically I ran:
import os
import polars as pl
import pandas as pd
from deltalake import DeltaTable, write_deltalake
from pyarrow.parquet import read_metadata
nr = 1000000
df = pl.DataFrame({
'P': ['X'] * nr,
'A': [f'abc_{i}' for i in range(nr)],
'B': [f'def_{i}' for i in range(nr)]
})
write_deltalake('data/row_groups',
df.to_arrow(),
partition_by='P',
mode='overwrite',
min_rows_per_group = nr // 100,
max_rows_per_group = nr // 10,
engine = 'rust'
)
dt = DeltaTable('data/row_groups')
pq_file = os.path.join('data/row_groups/', dt.get_add_actions(flatten=True).to_pandas()['path'].values[0])
read_metadata(pq_file)
which shows:
<pyarrow._parquet.FileMetaData object at 0x7f04c8b3a390>
created_by: parquet-rs version 52.2.0
num_columns: 2
num_rows: 1000000
num_row_groups: 1
format_version: 1.0
serialized_size: 504
I expected the min/max_row_groups settings to be respected at the level of the parquet file.
Am I misunderstanding what those settings are meant for? Thank you,
Matt
The text was updated successfully, but these errors were encountered:
@vincenzon these settings are for the pyarrow Engine, for the rust engine which is now the default you should use the WriterProperties class to set the max_row_group_size
@ion-elgreco I see, thanks. I tried setting max_row_group_size to a high number and the resulting parquet file still has two row groups. While not wrong, it is not having the effect I want which is to have a large number of groups. Is there a plan to add a min_row_group_size?
I'm using deltalake version 0.19.1 and trying to make it so the parquet files in my deltalake table have a large number of row groups. I tried setting min_row_groups = 10000 and max_row_groups = 100000 for a 1000000 row table but I get a single row group.
Specifically I ran:
which shows:
I expected the min/max_row_groups settings to be respected at the level of the parquet file.
Am I misunderstanding what those settings are meant for? Thank you,
Matt
The text was updated successfully, but these errors were encountered: