feat(python): Set smaller defaults on row group and file size #818

wjones127 · 2022-09-16T02:26:31Z

Description

Arrow defaults max_rows_per_group to 1024 * 1024, but in many instances this will lead to files only having one row group, limiting parallelism in future reads. So I set the max to 128 * 1024, close to what DuckDB defaults to (100,000 I think).

I also set a 10 million row cap on file size, so that we also split across files. This is helpful for other readers like Spark, which IIRC use files not row groups as their level of parallelism.

Related Issue(s)

closes delta write should generate a sensible max_rows_per_group defaults #816

Documentation

Mytherin/duckdb@3b8ad03#diff-a95d5e017c81184e18f0f04c5df3b72061fd80555d581a4ce163af5deca3dac0R394

roeap

Eventually we should probably try and consider the targetFileSize setting on the delta table if it exists, but I think we need to go through the settings handling once more before doing this.

wjones127 added 2 commits September 15, 2022 19:22

Set smaller defaults on row group and file size

2332720

fix test

5f4e695

roeap approved these changes Sep 16, 2022

View reviewed changes

fvaleye approved these changes Sep 16, 2022

View reviewed changes

wjones127 merged commit 7188ebc into delta-io:main Sep 16, 2022

wjones127 deleted the limit-size branch September 16, 2022 14:31

wjones127 mentioned this pull request Feb 21, 2023

GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi apache/arrow#34281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Set smaller defaults on row group and file size #818

feat(python): Set smaller defaults on row group and file size #818

wjones127 commented Sep 16, 2022

roeap left a comment

feat(python): Set smaller defaults on row group and file size #818

feat(python): Set smaller defaults on row group and file size #818

Conversation

wjones127 commented Sep 16, 2022

Description

Related Issue(s)

Documentation

roeap left a comment

Choose a reason for hiding this comment