-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable parquet page level skipping (page index pruning) by default #5099
Conversation
680aaa5
to
cb221af
Compare
e1ce71b
to
294e121
Compare
All that is needed now for this PR I think is some benchmark results |
294e121
to
c8bda4a
Compare
TLDR looks like this feature makes Q7 and Q16 slower on TPCH benchmarks I think we need to review this more To test I used
And ran the tpch queries or both SF1 and SF10 (1GB and 10GB against parquet datasets) on a google cloud machine: cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ~/tpch_data/parquet_data_SF1 --format parquet -o ~/enable_page_index My results are as follows
|
I will try to fix this week 😄 hope find the bottleneck. |
That would be amazing @Ted-Jiang - thank you so much. I would love to help out too -- this one has been on my list for a long long time. |
I test in my local M1, seems there is no regression in q16 (which page index prune none data). Set
Form the From the plan
I found there is one place need improvement, which read So i want to improve this and retest this in cloud env, @alamb Is this reasonable 🤔 PATL New Edit, i found even in cloud env, your test data file still on local disk 🤦 , but i think this still need improve |
@Ted-Jiang so is it your analysis that the difference negligible? I updated this branch to the latest main and I'll rerun the benchmarks and see what they show |
So I reran the tpch (against parquet at scale factor 1) benchmarks and things on this branch now look good (a 10% boost):
|
Yes, it is negligible and still need a improvement of #6315 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb verify this 👍
@@ -49,7 +49,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus | |||
| datafusion.execution.collect_statistics | false | Should DataFusion collect statistics after listing files | | |||
| datafusion.execution.target_partitions | 0 | Number of partitions for query execution. Increasing partitions can increase concurrency. Defaults to the number of CPU cores on the system | | |||
| datafusion.execution.time_zone | +00:00 | The default time zone Some functions, e.g. `EXTRACT(HOUR from SOME_TIME)`, shift the underlying datetime according to this time zone, and then extract the hour | | |||
| datafusion.execution.parquet.enable_page_index | false | If true, uses parquet data page level metadata (Page Index) statistics to reduce the number of rows decoded. | | |||
| datafusion.execution.parquet.enable_page_index | true | If true, uses parquet data page level metadata (Page Index) statistics to reduce the number of rows decoded. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we also skip IO when the whole page is skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refine the description as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is time -- thanks a lot @Ted-Jiang ! |
Thanks for your advancing events too 👍 |
Draft until:Which issue does this PR close?
Closes #4085
Related to #6287
Rationale for this change
We have several performance features that help querying parquet that are not enabled by default. This not only decreases our test coverage of those features, it also means users of DataFusion do not see the benefits of using them.
This feature is awesome and we should have it on as it improves performance -- see https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
What changes are included in this PR?
Changes:
Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
Hopefully faster queries on parquet
benchmark results:
See below