feat(python): Set smaller defaults on row group and file size #818
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Arrow defaults
max_rows_per_group
to1024 * 1024
, but in many instances this will lead to files only having one row group, limiting parallelism in future reads. So I set the max to128 * 1024
, close to what DuckDB defaults to (100,000 I think).I also set a 10 million row cap on file size, so that we also split across files. This is helpful for other readers like Spark, which IIRC use files not row groups as their level of parallelism.
Related Issue(s)
Documentation
Mytherin/duckdb@3b8ad03#diff-a95d5e017c81184e18f0f04c5df3b72061fd80555d581a4ce163af5deca3dac0R394