Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min/max_row_groups not respected #2814

Closed
vincenzon opened this issue Aug 22, 2024 · 4 comments
Closed

min/max_row_groups not respected #2814

vincenzon opened this issue Aug 22, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@vincenzon
Copy link

I'm using deltalake version 0.19.1 and trying to make it so the parquet files in my deltalake table have a large number of row groups. I tried setting min_row_groups = 10000 and max_row_groups = 100000 for a 1000000 row table but I get a single row group.

Specifically I ran:

import os
import polars as pl
import pandas as pd
from deltalake import DeltaTable, write_deltalake
from pyarrow.parquet import read_metadata

nr = 1000000
df = pl.DataFrame({
    'P': ['X'] * nr,
    'A': [f'abc_{i}' for i in range(nr)],
    'B': [f'def_{i}' for i in range(nr)]
})

write_deltalake('data/row_groups',
                df.to_arrow(),
                partition_by='P',
                mode='overwrite',
                min_rows_per_group = nr // 100,
                max_rows_per_group = nr // 10,
                engine = 'rust'
                )

dt = DeltaTable('data/row_groups')

pq_file = os.path.join('data/row_groups/', dt.get_add_actions(flatten=True).to_pandas()['path'].values[0])

read_metadata(pq_file)

which shows:

<pyarrow._parquet.FileMetaData object at 0x7f04c8b3a390>
  created_by: parquet-rs version 52.2.0
  num_columns: 2
  num_rows: 1000000
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 504

I expected the min/max_row_groups settings to be respected at the level of the parquet file.

Am I misunderstanding what those settings are meant for? Thank you,

Matt

@vincenzon vincenzon added the bug Something isn't working label Aug 22, 2024
@ion-elgreco
Copy link
Collaborator

@vincenzon these settings are for the pyarrow Engine, for the rust engine which is now the default you should use the WriterProperties class to set the max_row_group_size

@vincenzon
Copy link
Author

@ion-elgreco I see, thanks. I tried setting max_row_group_size to a high number and the resulting parquet file still has two row groups. While not wrong, it is not having the effect I want which is to have a large number of groups. Is there a plan to add a min_row_group_size?

Thank you,

Matt

@ion-elgreco
Copy link
Collaborator

@vincenzon min_row_group_size needs to be implement by the parquet-rs crate, so that's out of control

@vincenzon
Copy link
Author

OK, thanks for the info.

Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants