You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That's a great point
The threshold of CoalesceBatches might be a related place for tuning: https://github.com/apache/arrow-datafusion/blob/52cf58b46133d448e067455baab0faf8a50e565a/datafusion/core/src/physical_plan/coalesce_batches.rs#L230
The current implementation I think will trigger coalesce when the input batch is < default batch size
For example, if two consecutive inputs of CoalesceBatchesExec have batch size 8000 (default 8192), then they will be concatenated, introducing unnecessary memcpy
This will happen if the query has some high selectivity predicates (e.g. TPCH Q1), I experimented setting the coalescing threshold to target_batch_size * 0.8, Q1 can run ~20% faster
Is your feature request related to a problem or challenge?
A small test to double the batch size (from 8192 to 16384) shows some performance improvements (~10%) on some queries:
This is nice for such a small change and aligns with #6287
Describe the solution you'd like
Configure a (more) optimal default
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: