Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Set smaller defaults on row group and file size #818

Merged
merged 2 commits into from
Sep 16, 2022

Conversation

wjones127
Copy link
Collaborator

Description

Arrow defaults max_rows_per_group to 1024 * 1024, but in many instances this will lead to files only having one row group, limiting parallelism in future reads. So I set the max to 128 * 1024, close to what DuckDB defaults to (100,000 I think).

I also set a 10 million row cap on file size, so that we also split across files. This is helpful for other readers like Spark, which IIRC use files not row groups as their level of parallelism.

Related Issue(s)

Documentation

Mytherin/duckdb@3b8ad03#diff-a95d5e017c81184e18f0f04c5df3b72061fd80555d581a4ce163af5deca3dac0R394

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should probably try and consider the targetFileSize setting on the delta table if it exists, but I think we need to go through the settings handling once more before doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delta write should generate a sensible max_rows_per_group defaults
3 participants