-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid extra copies in CoalesceBatchesExec
to improve performance
#7957
Comments
Is the majority of the concat_batches overhead string concatenation? Just wondering if this is something that StringView might help with? |
From an internal conversation, @tustvold points out that The current code uses The proposal would use Thus we may have to potentially special case how the columns are handled depending on type (e.g. what kernels are called for what operations) |
CoalesceBatchesExec
to improve performaceCoalesceBatchesExec
to improve performance
I tried (https://github.com/Dandandan/arrow-datafusion/tree/buffer_batches) to convert the code to use
|
Another finding to add to the list is that changing |
Thanks for the info @Dandandan -- those are some interesting results |
I made some pictures for what I had in mind This is what happens today (
Instead do it like this (only buffer the filter mask results and create the final output in one go):
|
Maybe we could add a The signature would be like /// filters paris of arrays / predicates into a single output arrauy
pub fn filter_multi(
input: impl IntoIterator<Item = (&dyn Array, &BooleanArray>)
) -> Result<Arc<dyn Array>, ArrowError> And then the internal machinery could be the same |
TPCH-Q1's expensive
And the following
But if it didn't do coalescing, the output batch still benefits from vectorization, so maybe this coalescing threshold can be better tuned to like if self.buffered_rows >= 0.6 * self.target_batch_size {
I remember I tried before, but the overall performance on Q1 is like 2% improvement, I think it's possible to set a better threshold for triggering coalescing. |
Thanks @2010YOUY01 -- that sounds neat. Also thanks for the tip related to Q1 as a good candidate query this could improve. In general I agree there are likely easier ways to improve specific queries. I am still bullish on the more general idea off trying to avoid the copy (I think it would be good for several percentage points across many queries), though it will be a fairly substantial taks I am going to spend some more time prototyping and see how it goes |
I am pretty happy with how this is headed. I have a PR to start refactoring the code here #11610 I am also feeling good about my WIP prototype |
Is your feature request related to a problem or challenge?
While looking at TPCH query performance for #6782 I noticed several times that
CoalesceBatchesExec
takes non trivial amounts of time (like 5% of the overall query time)Here is a specific examples
Create Data:
cd arrow-datafusion/benchmarks ./bench.sh data tpch10
Run query with datafusion-cli:
Here is the full
EXPLAIN ANLAYZE
output:explan-analyze-q8.txt
A small subset shows there is a single
CoalesceBatchesExec
that takes 3 seconds (elapsed_compute=3.066514072s
):I profiled the query and confirmed that
CoalesceBatchExec
takes 5% of the overall time, as shown in this screen shotIn diagrams this looks like
Describe the solution you'd like
I think we can avoid this overhead by combining the behavior of
CoalesceBatchesExec
into the operators that make small batches (FilterExec
,JoinExec
, andRepartitionExec
). Something likeThe idea would be to take the core coalesce logic from
CoalesceBatchesExec
that callsconcat_batches
And instead of creating new small record batches inFilterExec
,HashJoinExec
, andRepartitionExec
buffer the inputs until there are at leasttarget_batch_size
rows available, and then callinterleave
insteadHere is the code in
CoalesceBatchesExec
that could be adapted:https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/coalesce_batches.rs#L215-L283
Here is where FilterExec makes the potentially small record batches
https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/filter.rs#L294-L306
The same think would be done in RepartitionExec: https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/repartition/mod.rs#L193-L218
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: