Fuse grouped aggregate and filter operators for improved performance #5944

andygrove · 2023-04-10T14:16:18Z

Is your feature request related to a problem or challenge?

When we perform a grouped aggregate on a filtered input (such as with TPC-H q1), the filter operator performs two main tasks:

Evaluate the filter predicate (usually very fast)
Create new batches and copy over the filtered data (very slow if the filter is not very selective, as in q1)

I wonder if we would see a significant performance improvement if we could avoid creating the filtered batches in this case.

One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions.

Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate.

Describe the solution you'd like

I am working on a small prototype of this, outside of DataFusion, that I will share once the code is less embarrassing.

Describe alternatives you've considered

It would be worth seeing how other engines handle this.

Additional context

No response

andygrove · 2023-04-10T14:16:47Z

@Dandandan @alamb wdyt?

Dandandan · 2023-04-10T14:31:43Z

One other source of inefficiency (and should be relatively easy to change) is that currently we output the entire RecordBatch, including the columns that are needed to evaluate the filter, while throwing those columns away in the following Projection.

See:
See #5436

Dandandan · 2023-04-10T14:36:13Z

In q1 this would remove l_shipdate from the list of 7 columns to be copied, so I would expect a smaller improvement here.

Dandandan · 2023-04-10T14:38:09Z

I remind seeing some issue/papers about a similar approach before to this, maybe those were shared by @alamb ?

alamb · 2023-04-10T17:20:35Z

One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions.

I think this basic idea is called a "selection vector" in the literature -- and as you hint at, it is not quite the same as the null mask as it has different semantics.

One approach might be to add another enum type to ColumnarValue that had an additional selection mask

https://github.com/apache/arrow-datafusion/blob/bbc71692fcd8dd9f3a9686162e59d092b37031f2/datafusion/expr/src/columnar_value.rs#L33

After @tustvold 's recent work in Arrow, I think this would just be a https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html and should be straightforward to use.

To really take advantage of a selection vector, however, the underlying compute kernels need to be updated to know how to ignore the selection vectors (and likely only do so when they are sparse)

Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate.

While not exactly the same, @yjshen 's has been workking to add filtering to the aggregate input here, which is similar: #5868

tustvold · 2023-04-10T21:40:46Z

apache/arrow-rs#3620 may be related

tustvold · 2023-06-14T12:12:52Z

This would potentially change the semantics in some cases though so we can probably only do this under certain conditions

I'm curious about this, in what situation would the nullability or not of a non-selected value matter? It is just going to be discarded regardless? See apache/arrow-rs#3620

andygrove · 2024-07-31T15:28:22Z

I did some research / prototyping on this idea.

I used this logic to create a selection vector from a predicate bit mask:

    let num_true = predicate.true_count();
    let mut b = Int32Builder::with_capacity(num_true);
    for i in 0..predicate.len() {
        if predicate.value(i) {
            b.append_value(i as i32);
        }
    }
    let offsets = b.finish();

The selection vector can then be used in a naive sum aggregate like this:

    let mut sum = 0;
    for i in 0..selection_vector.len() {
        sum += array.value(selection_vector.value(i) as usize);
    }

If we ignore the cost of creating the selection vector, then this approach is faster than filtering the batch first. However, the cost of creating the selection vector is similar to filtering a batch in some cases, such as when we are aggregating a single column.

I'm less excited about using selection vectors for aggregates at this point.

andygrove added enhancement New feature or request performance Make DataFusion faster labels Apr 10, 2023

This was referenced Apr 11, 2023

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation apache/arrow-rs#3620

Open

[EPIC] A list of performance improvement tickets #5546

Open

yjshen mentioned this issue Apr 16, 2023

Arrow compute kernel regards selection vector apache/arrow-rs#4095

Closed

mingmwang mentioned this issue Apr 17, 2023

Row accumulator support update Scalar values #6003

Merged

andygrove self-assigned this Jun 13, 2024

andygrove mentioned this issue Jun 13, 2024

[EPIC] Improving Performance apache/datafusion-comet#566

Open

andygrove closed this as completed Jul 31, 2024

andygrove mentioned this issue Jul 31, 2024

[EPIC] Performance focus for 0.2.0 Release apache/datafusion-comet#717

Closed

5 tasks

alamb mentioned this issue Aug 1, 2024

Reuse hash #11708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse grouped aggregate and filter operators for improved performance #5944

Fuse grouped aggregate and filter operators for improved performance #5944

andygrove commented Apr 10, 2023

andygrove commented Apr 10, 2023

Dandandan commented Apr 10, 2023 •

edited

Loading

Dandandan commented Apr 10, 2023

Dandandan commented Apr 10, 2023

alamb commented Apr 10, 2023 •

edited

Loading

tustvold commented Apr 10, 2023

tustvold commented Jun 14, 2023

andygrove commented Jul 31, 2024

Fuse grouped aggregate and filter operators for improved performance #5944

Fuse grouped aggregate and filter operators for improved performance #5944

Comments

andygrove commented Apr 10, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

andygrove commented Apr 10, 2023

Dandandan commented Apr 10, 2023 • edited Loading

Dandandan commented Apr 10, 2023

Dandandan commented Apr 10, 2023

alamb commented Apr 10, 2023 • edited Loading

tustvold commented Apr 10, 2023

tustvold commented Jun 14, 2023

andygrove commented Jul 31, 2024

Dandandan commented Apr 10, 2023 •

edited

Loading

alamb commented Apr 10, 2023 •

edited

Loading