Bad Join Order for TPCH Q17 results in slow performance #7949

alamb · 2023-10-27T14:15:40Z

Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query.

To Reproduce

Step 1: Create data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Step 2: Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	sum(l_extendedprice) / 7.0 as avg_yearly	from	lineitem,	part	where	p_partkey = l_partkey	and p_brand = 'Brand#23'	and p_container = 'MED BOX'	and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem	where	l_partkey = p_partkey	);"

Takes 7.52 seconds

+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 7.525 seconds.

However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select   sum(l_extendedprice) / 7.0 as avg_yearly        from    part,lineitem   where   p_partkey = l_partkey   and p_brand = 'Brand#23'        and p_container = 'MED BOX' and l_quantity < (      select  0.2 * avg(l_quantity)   from    lineitem        where   l_partkey = p_partkey   );"

DataFusion CLI v32.0.0
+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 4.560 seconds.

Here is the difference:

select
sum(l_extendedprice) / 7.0 as avg_yearly from
- lineitem, part
+ part, lineitem
where
  p_partkey = l_partkey
  and p_brand = 'Brand#23'
  and p_container = 'MED BOX'
  and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem where	l_partkey = p_partkey	);

Expected behavior

DataFusion should pick the correct join order

Analysis

What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of EXPLAIN ANALYZE):

                                       DataFusion TPCH SF10 Q17 Plan                  
                                      annotated with output row counts                
                                                                                      
                                                                                      
                                                                                      
                                                            1 Row                     
                                                                                      
                                                ┌────────────────┐                    
                                                │   Aggregate    │                    
                                                │                │                    
This join is building a 60M row hash            │                │                    
 table and probing 2M rows, rather              └────────────────┘                    
 than building a 2M row hash table                       │                            
    and probing a 60M row table                          ▼        2M Rows             
                  │                             ┌────────────────┐                    
                                                │  HashJoin (1)  │                    
                  │                             │   p_partkey =  │                    
                                                │   l_partkey    │                    
                  │         2044 Rows           └───┬────────┬───┘                    
                                                    │        │            2M Rows     
                  │                ┌────────────────┘        └──────────────┐         
                                   │                                        │         
                  │                ▼                                        ▼         
                          ┌────────────────┐                       ┌────────────────┐ 
                  │       │  HashJoin (2)  │                       │   Aggregate    │ 
                   ─ ─ ─ ▶│   p_partkey =  │                       │                │ 
                          │   l_partkey    │                       │                │ 
                          └───┬────────┬───┘                       └────────────────┘ 
           60M Rows           │        │        2M Rows                     │         
                    ┌─────────┘        └──────────┐                         │         
                    │                             │                         │         
                    ▼                             ▼                         ▼         
          ┌──────────────────┐          ┌──────────────────┐      ┌──────────────────┐
          │Scan: lineitem    │          │Scan: part        │      │Scan: part        │
          │(no filters)      │          │p_brand = ..      │      │(no filters)      │
          │                  │          │p_container = ..  │      │                  │
          └──────────────────┘          └──────────────────┘      └──────────────────┘

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work.

The HashJoin operator in DataFusion takes two inputs:

         ┌───────────┐         
         │ HashJoin  │         
         │           │         
         └───────────┘         
             │   │             
       ┌─────┘   └─────┐       
       ▼               ▼       
┌────────────┐  ┌─────────────┐
│   Input    │  │    Input    │
│    [0]     │  │     [1]     │
└────────────┘  └─────────────┘
                               
 "build side"    "probe side"

Execution proceeds in 2 stages:

                ┌────────────────┐          ┌────────────────┐                        
                │ ┌─────────┐    │          │ ┌─────────┐    │                        
                │ │  Hash   │    │          │ │  Hash   │    │                        
                │ │  Table  │    │          │ │  Table  │    │                        
                │ │(keys are│    │          │ │(keys are│    │                        
                │ │equi join│    │          │ │equi join│    │  Stage 2: batches from 
 Stage 1: the   │ │columns) │    │          │ │columns) │    │    the probe side are  
*entire* build  │ │         │    │          │ │         │    │  streamed through, and 
 side is read   │ └─────────┘    │          │ └─────────┘    │   checked against the  
into the hash   │      ▲         │          │          ▲     │   contents of the hash 
    table       │       HashJoin │          │  HashJoin      │          table         
                └──────┼─────────┘          └──────────┼─────┘                        
            ─ ─ ─ ─ ─ ─                                 ─ ─ ─ ─ ─ ─ ─                 
           │                                                         │                
                                                                                      
           │                                                         │                
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
          ...                                                       ...               
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
                                                                                      
       build side                                                probe side

This asymmetry in behavior has the important consequence that it is very important that the smaller side is hashed

So this means in a classic "Star Schema Query", the optimal plan will be a "Right Deep Tree" , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:

            ┌───────────┐                                    
            │ HashJoin  │                                    
            │           │                                    
            └───────────┘                                    
                │   │                                        
        ┌───────┘   └──────────┐                             
        ▼                      ▼                             
┌───────────────┐        ┌───────────┐                       
│ small table 3 │        │ HashJoin  │                       
│  "dimension"  │        │           │                       
└───────────────┘        └───┬───┬───┘                       
                  ┌──────────┘   └───────┐                   
                  │                      │                   
                  ▼                      ▼                   
          ┌───────────────┐        ┌───────────┐             
          │ small table 2 │        │ HashJoin  │             
          │  "dimension"  │        │           │             
          └───────────────┘        └───┬───┬───┘             
                              ┌────────┘   └────────┐        
                              │                     │        
                              ▼                     ▼        
                      ┌───────────────┐     ┌───────────────┐
                      │ small table 1 │     │  large table  │
                      │  "dimension"  │     │    "fact"     │
                      └───────────────┘     └───────────────┘

Additional context

This is likely one of the root causes of #5646

The text was updated successfully, but these errors were encountered:

artorias1024 · 2023-11-06T04:06:53Z

If we only focus on Query 17, is the ideal execution plan for this SQL statement supposed to be like the following?

                ┌───────────────────────────────────┐                     
                │             Aggregate             │                     
                │                                   │                     
                │                                   │                     
                └──────────────────┬────────────────┘                     
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │  Filter: lineitem.l_quantity <   │                     
                 │((0.2 * (avg(lineitem.l_quantity) │                     
                 │           OVER (?))))            │                     
                 │                                  │                     
                 └──────────────────────────────────┘                     
                                   │                                      
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │         Window Aggregate         │                     
                 │ output: avg(lineitem.l_quantity) │                     
                 │  OVER (?), lineitem.l_quantity,  │                     
                 │    lineitem.l_extendedprice,     │                     
                 └──────────────────────────────────┘                     
                                   │                                      
                                   │                                      
                                   │                                      
                                   ▼                                      
                 ┌──────────────────────────────────┐                     
                 │            Hash Join             │                     
                 │                                  │                     
                 │                                  │                     
                 │                                  │                     
                 └─────────┬───────────────┬────────┘                     
                                                                          
                           │               │                              
                ┌ ─ ─ ─ ─ ─                 ─ ─ ─ ─ ─ ─ ─                 
                                                         │                
                │                                                         
                                                         │                
                ▼                                        ▼                
┌───────────────────────────────┐        ┌───────────────────────────────┐
│       Table Scan: part        │        │     Table Scan: lineitem      │
│ Filter: part.p_brand = .. AND │        │                               │
│     part.p_container = ..     │        │                               │
│                               │        │                               │
└───────────────────────────────┘        └───────────────────────────────┘

alamb · 2023-11-06T12:15:40Z

If we only focus on Query 17, is the ideal execution plan for this SQL statement supposed to be like the following?

Here is what I think a much better plan would be:

alamb added the bug Something isn't working label Oct 27, 2023

andygrove self-assigned this Nov 10, 2023

andygrove mentioned this issue Nov 10, 2023

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Merged

Dandandan closed this as completed in #8126 Nov 12, 2023

alamb mentioned this issue Dec 31, 2023

Implement statistics support for Substrait #8698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad Join Order for TPCH Q17 results in slow performance #7949

Bad Join Order for TPCH Q17 results in slow performance #7949

alamb commented Oct 27, 2023 •

edited

Loading

artorias1024 commented Nov 6, 2023

alamb commented Nov 6, 2023

Bad Join Order for TPCH Q17 results in slow performance #7949

Bad Join Order for TPCH Q17 results in slow performance #7949

Comments

alamb commented Oct 27, 2023 • edited Loading

Describe the bug

To Reproduce

Step 1: Create data:

Step 2: Run query with datafusion-cli:

Expected behavior

Analysis

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Additional context

artorias1024 commented Nov 6, 2023

alamb commented Nov 6, 2023

alamb commented Oct 27, 2023 •

edited

Loading