-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] A collection of issues to improve planning performance / speed / efficiency #5637
Comments
Also I'd like to consider replace list in DFSchema by case_insensitive_hashmap or something similar in order to get value with O(1) complexity instead of O(N). As I understand, now complexity is O(N^2) due two loops of iterations ( |
Yes, I think there is a lot of room for improvement (though we need to be careful about taking on crate dependencies that might not have a good long term maintenance story) |
Here are some other recent discussions about how to improve planning speed: |
An update here. Thanks to a bunch of work by @haohuaijin @matthewmturner @jayzhan211 @peter-toth @jackwener and myself, the planning speed on 38.0.0 is looking to be quite a bit better 20%-700% better in many cases. I am fairly confident there is still another factor of 2 to be had by completing #9637, which I expect to complete over the next few weeks
I compared Comparison
|
@alamb Hi, amazing work have been done! It's became much more speedy. But it seems that the complexity of algorithms is still O(n^2) |
I agree there are still places that are N^2 in the number of columns. With @haohuaijin 's great work in #9595 I think adding an index (perhaps computed on demand) to It would be great if someone wanted to give that a try |
We recently updated to the latest Datafusion and we've seen our planning time go from ~20ms to ~10ms! Great job on this. |
That is great to hear --- thanks for the report @matthewmturner BTW I think there is still significant improvement to be had by completing #9637. I don't think we'll get it all done by 38.0.0 but I think we'll improve it some more |
Current progress
50% faster for tpcds and tpch planning
Note I expect another 30-40% combined savings between #10356 and #10209 and #9873 |
Here is where we currently stand with planning performance compared to 37 and 38 Highlight: TPC-DS 76% faster planning, TPCH 64% faster
Highlight:
Test script Details
set -x -e
## This script tests planning speed of 37.0.0 against the speed on planning on main
git fetch -p apache
git fetch -p alamb
# remove old test runs
rm -rf target/criterion/
# Compare version 38
git checkout 38.0.0
cargo update
cargo bench --bench sql_planner -- --save-baseline "38.0.0"
# use a version of 37 with the tpcds benchmarks
git checkout alamb/37_bench
git reset --hard alamb/alamb/37_bench
cargo update
cargo bench --bench sql_planner -- --save-baseline "37.0.0"
echo "** Comparing to main"
git checkout main
git reset --hard apache/main
cargo update
cargo bench --bench sql_planner -- --save-baseline main
critcmp main "38.0.0" "37.0.0" |
This is a collection of tickets related to making DataFusion's planning speed faster. Planning speed is the time from a SQL string being created to when the
ExecutionPlan
is createdis_err()
) #5309Box
es withArc
in theExpr
enum
. #9577Expr
s and LogicalPlans so much during Common Subexpression Elimination #9873LogicalPlan
during OptimizerPasses #9637CommonSubexprEliminate
faster by stop copying so many strings #10426The text was updated successfully, but these errors were encountered: