-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafusion.optimizer.repartition_file_scans
enabled by default
#5295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea. Thank you @korowa
What do you think @tustvold @Dandandan @andygrove -- any concerns about turning on automatic repartitioned file scans (which allows scanning a single large parquet file in parallel) |
This should help with mitigating the "low-performance" impression -- many people will not dig deep to configuration options and simply try things out with OOTB defaults. |
I will plan to merge this sometime over the weekend unless anyone else would like time to comment or offer more thoughts |
Look great to me. |
Benchmark runs are scheduled for baseline = cfbb14d and contender = 222205d. 222205d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #5125.
Rationale for this change
I guess, it's fine to enable repartitioning by default in 19.0.0 (or, more likely, first release candidate for 19.0.0)
What changes are included in this PR?
Default value of
datafusion.optimizer.repartition_file_scans
istrue
now.Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
Repartitioning of file scans will be enabled by default