Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid extra copies in CoalesceBatchesExec to improve performance #7957

Open
Tracked by #11710
alamb opened this issue Oct 27, 2023 · 10 comments
Open
Tracked by #11710

Avoid extra copies in CoalesceBatchesExec to improve performance #7957

alamb opened this issue Oct 27, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

Is your feature request related to a problem or challenge?

While looking at TPCH query performance for #6782 I noticed several times that CoalesceBatchesExec takes non trivial amounts of time (like 5% of the overall query time)

Here is a specific examples

Create Data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	o_year,	sum(case	when nation = 'BRAZIL' then volume	else 0	end) / sum(volume) as mkt_share	from	(	select	extract(year from o_orderdate) as o_year,	l_extendedprice * (1 - l_discount) as volume,	n2.n_name as nation	from	part,	supplier,	lineitem,	orders,	customer,	nation n1,	nation n2,	region	where	p_partkey = l_partkey	and s_suppkey = l_suppkey	and l_orderkey = o_orderkey	and o_custkey = c_custkey	and c_nationkey = n1.n_nationkey	and n1.n_regionkey = r_regionkey	and r_name = 'AMERICA'	and s_nationkey = n2.n_nationkey	and o_orderdate between date '1995-01-01' and date '1996-12-31'	and p_type = 'ECONOMY ANODIZED STEEL'	) as all_nations	group by	o_year	order by	o_year;"
select	o_year,
  sum(case	when nation = 'BRAZIL' then volume	else 0	end) / sum(volume) as mkt_share
from (
  select
    extract(year from o_orderdate) as o_year,
    l_extendedprice * (1 - l_discount) as volume,
    n2.n_name as nation
  from
    part,
    supplier,
    lineitem,
    orders,
    customer,
    nation n1,
    nation n2,
    region
  where
    p_partkey = l_partkey
    and s_suppkey = l_suppkey
    and l_orderkey = o_orderkey
    and o_custkey = c_custkey
    and c_nationkey = n1.n_nationkey
    and n1.n_regionkey = r_regionkey
    and r_name = 'AMERICA'
    and s_nationkey = n2.n_nationkey
    and o_orderdate between date '1995-01-01' and date '1996-12-31'	and p_type = 'ECONOMY ANODIZED STEEL'	) as all_nations
  group by
 	o_year
  order by
  	o_year;"

Here is the full EXPLAIN ANLAYZE output:
explan-analyze-q8.txt

A small subset shows there is a single CoalesceBatchesExec that takes 3 seconds (elapsed_compute=3.066514072s):

CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=59986052, elapsed_compute=3.066514072s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
  RepartitionExec: partitioning=Hash([l_partkey@1], 16), input_partitions=16, metrics=[fetch_time=16.321190026s, repart_time=10.382230411s, send_time=4.650058274s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
     ParquetExec: file_groups={16 groups: [[Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10

I profiled the query and confirmed that CoalesceBatchExec takes 5% of the overall time, as shown in this screen shot

Screenshot 2023-10-27 at 7 30 00 AM

In diagrams this looks like

┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)                                                                         
                                                                                                              

Describe the solution you'd like

I think we can avoid this overhead by combining the behavior of CoalesceBatchesExec into the operators that make small batches (FilterExec, JoinExec, and RepartitionExec). Something like

┌────────────────────┐        Filter                                                                        
│                    │                    ┌────────────────────┐          Filter output                     
│                    │    ─ ─ ─ ─ ─ ─ ▶   │        mask        │                                            
│    RecordBatch     │                    │   (BooleanArray)   │─ ─ ─ ─ ─ ┐                                 
│  num_rows = 8000   │                    └────────────────────┘                                            
│                    │                                                    │                                 
│                    │                                                                ┌────────────────────┐
└────────────────────┘                                                    │           │                    │
┌────────────────────┐                    ┌────────────────────┐                      │                    │
│                    │        Filter      │                    │          │           │                    │
│                    │                    │        mask        │           ─ ─ ─ ─ ─ ▶│                    │
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   (BooleanArray)   │─ ─ ─ ─ ─ ┐           │                    │
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │
│                    │                    └────────────────────┘                      │                    │
└────────────────────┘                                                                │                    │
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │
          ...                   ...                                       │           │                    │
                                                                                      │                    │
┌────────────────────┐                                                    │           └────────────────────┘
│                    │                    ┌────────────────────┐                                            
│                    │       Filter       │                    │          │                                 
│    RecordBatch     │                    │        mask        │                                            
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   (BooleanArray)   │─ ─ ─ ─ ─ ┘                                 
│                    │                    │                    │                                            
│                    │                    └────────────────────┘                                            
└────────────────────┘                                                                                      
                                                                                                            
                      FilterExec et all                             The Exec then copies rows from          
                      internally buffers RecordBatches and          multiple input batches (based on the    
                      filter results until enough rows are          masks) into a single new output         
                      ready                                         RecordBatch avoiding the second data    
                                                                    copy                                    

The idea would be to take the core coalesce logic from CoalesceBatchesExec that calls concat_batches And instead of creating new small record batches in FilterExec, HashJoinExec, and RepartitionExec buffer the inputs until there are at least target_batch_size rows available, and then call interleave instead

Here is the code in CoalesceBatchesExec that could be adapted:
https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/coalesce_batches.rs#L215-L283

Here is where FilterExec makes the potentially small record batches
https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/filter.rs#L294-L306

The same think would be done in RepartitionExec: https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/repartition/mod.rs#L193-L218

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 27, 2023
@tustvold
Copy link
Contributor

Is the majority of the concat_batches overhead string concatenation? Just wondering if this is something that StringView might help with?

@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2023

From an internal conversation, @tustvold points out that interleave is fair bit slower than the filter kernel, especially for primitives

The current code uses filter then concat_batches

The proposal would use interleave

Thus we may have to potentially special case how the columns are handled depending on type (e.g. what kernels are called for what operations)

@Dandandan Dandandan changed the title Avoid extra copies in CoalesceBatchesExec to improve performace Avoid extra copies in CoalesceBatchesExec to improve performance Nov 18, 2023
@Dandandan
Copy link
Contributor

Dandandan commented Nov 30, 2023

I tried (https://github.com/Dandandan/arrow-datafusion/tree/buffer_batches) to convert the code to use interleave instead of concat (CoalesceBatches) (for hash repartitioning), but this leads to no improvement whatsoever (even a small slowdown):

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ buffer_batches ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  93.31ms │        90.88ms │     no change │
│ QQuery 2     │  26.18ms │        28.39ms │  1.08x slower │
│ QQuery 3     │  53.96ms │        57.86ms │  1.07x slower │
│ QQuery 4     │  52.33ms │        51.77ms │     no change │
│ QQuery 5     │ 114.93ms │       125.69ms │  1.09x slower │
│ QQuery 6     │   9.38ms │         9.20ms │     no change │
│ QQuery 7     │ 211.60ms │       227.30ms │  1.07x slower │
│ QQuery 8     │  61.54ms │        59.56ms │     no change │
│ QQuery 9     │  60.66ms │        63.80ms │  1.05x slower │
│ QQuery 10    │ 114.31ms │       114.95ms │     no change │
│ QQuery 11    │  20.12ms │        20.54ms │     no change │
│ QQuery 12    │  59.98ms │        59.26ms │     no change │
│ QQuery 13    │  53.37ms │        56.42ms │  1.06x slower │
│ QQuery 14    │  18.47ms │        18.63ms │     no change │
│ QQuery 15    │  58.99ms │        59.31ms │     no change │
│ QQuery 16    │  22.79ms │        22.27ms │     no change │
│ QQuery 17    │  54.41ms │        53.59ms │     no change │
│ QQuery 18    │ 158.61ms │       162.68ms │     no change │
│ QQuery 19    │  36.98ms │        34.78ms │ +1.06x faster │
│ QQuery 20    │  62.49ms │        62.04ms │     no change │
│ QQuery 21    │ 251.42ms │       256.99ms │     no change │
│ QQuery 22    │  14.35ms │        14.14ms │     no change │
└──────────────┴──────────┴────────────────┴───────────────┘

@Dandandan
Copy link
Contributor

Dandandan commented Nov 30, 2023

Another finding to add to the list is that changing RepartitionExec to do a filter (with hash repartitioning) is a bit slower than take or interleave on tpc-h benchmarks.

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2023

Thanks for the info @Dandandan -- those are some interesting results

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2024

I made some pictures for what I had in mind

This is what happens today (FilterExec + RepartitionExec):

┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)                                                                         
                                                                                                              

Instead do it like this (only buffer the filter mask results and create the final output in one go):

┌────────────────────┐        Filter                                                                        
│                    │                    ┌────────────────────┐          Filter output                     
│                    │    ─ ─ ─ ─ ─ ─ ▶   │        mask        │                                            
│    RecordBatch     │                    │   (BooleanArray)   │─ ─ ─ ─ ─ ┐                                 
│  num_rows = 8000   │                    └────────────────────┘                                            
│                    │                                                    │                                 
│                    │                                                                ┌────────────────────┐
└────────────────────┘                                                    │           │                    │
┌────────────────────┐                    ┌────────────────────┐                      │                    │
│                    │        Filter      │                    │          │           │                    │
│                    │                    │        mask        │           ─ ─ ─ ─ ─ ▶│                    │
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   (BooleanArray)   │─ ─ ─ ─ ─ ┐           │                    │
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │
│                    │                    └────────────────────┘                      │                    │
└────────────────────┘                                                                │                    │
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │
          ...                   ...                                       │           │                    │
                                                                                      │                    │
┌────────────────────┐                                                    │           └────────────────────┘
│                    │                    ┌────────────────────┐                                            
│                    │       Filter       │                    │          │                                 
│    RecordBatch     │                    │        mask        │                                            
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   (BooleanArray)   │─ ─ ─ ─ ─ ┘                                 
│                    │                    │                    │                                            
│                    │                    └────────────────────┘                                            
└────────────────────┘                                                                                      
                                                                                                            
                      FilterExec et all                             The Exec then copies rows from          
                      internally buffers RecordBatches and          multiple input batches (based on the    
                      filter results until enough rows are          masks) into a single new output         
                      ready                                         RecordBatch avoiding the second data    
                                                                    copy                                    

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2024

Maybe we could add a filter_multi kernel along side https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

The signature would be like

/// filters paris of arrays / predicates into a single output arrauy
pub fn filter_multi(
    input: impl IntoIterator<Item = (&dyn Array, &BooleanArray>)
) -> Result<Arc<dyn Array>, ArrowError>

And then the internal machinery could be the same

@2010YOUY01
Copy link
Contributor

TPCH-Q1's expensive CoalesceBatches might be easier to solve 🤔
Unlike TPCH-Q8 (looks like with more selective predicate), Q1's filter only throws away a small number of rows:

input_batch(4096 rows) -> [FILTER] -> output_batch(~4000 rows)

And the following CoalesceBatches condition will be triggered every time to copy large output batches

if self.buffered_rows >= self.target_batch_size {

But if it didn't do coalescing, the output batch still benefits from vectorization, so maybe this coalescing threshold can be better tuned to like if self.buffered_rows >= 0.6 * self.target_batch_size {

I remember I tried before, but the overall performance on Q1 is like 2% improvement, I think it's possible to set a better threshold for triggering coalescing.

@alamb
Copy link
Contributor Author

alamb commented Jul 25, 2024

Thanks @2010YOUY01 -- that sounds neat. Also thanks for the tip related to Q1 as a good candidate query this could improve.

In general I agree there are likely easier ways to improve specific queries. I am still bullish on the more general idea off trying to avoid the copy (I think it would be good for several percentage points across many queries), though it will be a fairly substantial taks

I am going to spend some more time prototyping and see how it goes

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

I am pretty happy with how this is headed.

I have a PR to start refactoring the code here #11610

I am also feeling good about my WIP prototype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants