-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad Join Order for TPCH Q17 results in slow performance #7949
Labels
bug
Something isn't working
Comments
This was referenced Oct 27, 2023
If we only focus on Query 17, is the ideal execution plan for this SQL statement supposed to be like the following?
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query.
To Reproduce
Step 1: Create data:
cd arrow-datafusion/benchmarks ./bench.sh data tpch10
Step 2: Run query with datafusion-cli:
Takes 7.52 seconds
However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)
Here is the difference:
Expected behavior
DataFusion should pick the correct join order
Analysis
What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of
EXPLAIN ANALYZE
):Background: DataFusion Joins (I will also add this as documentation to datafusion)
Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work.
The HashJoin operator in DataFusion takes two inputs:
Execution proceeds in 2 stages:
This asymmetry in behavior has the important consequence that it is very important that the smaller side is hashed
So this means in a classic "Star Schema Query", the optimal plan will be a "Right Deep Tree" , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:
Additional context
This is likely one of the root causes of #5646
The text was updated successfully, but these errors were encountered: