Support vectorized append and compare for multi group by #12996

Rachelint · 2024-10-18T09:35:12Z

Which issue does this PR close?

Closes #.

Related to

Rationale for this change

Although GroupValuesColumn is stored the multi gourp by values in column oriented way.

However, it still use row oriented approach to perform append and equal to.

The most obvious overhead is that we need to downcast the array when processing each row, and instructions for downcast is actually not few, and even worse it will introduce branches.
And as I guess, the row oriented approach will also increase the random memory accesses but I am not sure.

What changes are included in this PR?

This pr introduce the vectorized append and vectorized equal to for GroupValuesColumn.

But such vectorized appoach is not compatible with streaming aggregation depending on the order between input rows and their corresponding gourp indices.

So I define a new VectorizedGroupValuesColumn for optimizing non streaming aggregation cases, and keep the original GroupValuesColumn for the streaming aggregation cases.

Are these changes tested?

Yes, I think enough new unit tests are added.

Are there any user-facing changes?

No.

Dandandan · 2024-10-18T10:09:10Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

@@ -128,6 +132,15 @@ impl<T: ArrowPrimitiveType, const NULLABLE: bool> GroupColumn
        }
    }

+    fn append_non_nullable_val(&mut self, array: &ArrayRef, row: usize) {
+        if NULLABLE {
+            self.nulls.append(false);


This could be optimized to append nulls for entire batch instead of per value

Yes, I plan to refactor the interface for supporting input a rows: &[usize].
And make all parts' appending vectorized, and see the performance again.

(i.e. remove it here and call it in such a way we use https://docs.rs/arrow/latest/arrow/array/struct.BooleanBufferBuilder.html#method.append_n

I add the append_batch function to support vectorized append more better.
But the improvement seems still not obvious. #12996 (comment)

🤔 I guess, it is likely due the new introduced branch of equal_to:

if *group_idx < group_values_len { for (i, group_val) in self.group_values.iter().enumerate() { if !check_row_equal(group_val.as_ref(), *group_idx, &cols[i], row) { return false; } } } else { let row_idx_offset = group_idx - group_values_len; let row_idx = self.append_rows_buffer[row_idx_offset]; return is_rows_eq(cols, row, cols, row_idx).unwrap(); }

To eliminate this extra branch, I think we need to refactor the intern process metioned in #12821 (comment)

I am trying it.

Rachelint · 2024-10-19T11:10:11Z

The latest benchmark numbers:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.67ms │                 0.68ms │     no change │
│ QQuery 1     │    67.01ms │                65.25ms │     no change │
│ QQuery 2     │   165.14ms │               157.75ms │     no change │
│ QQuery 3     │   181.43ms │               181.83ms │     no change │
│ QQuery 4     │  1566.65ms │              1574.95ms │     no change │
│ QQuery 5     │  1539.79ms │              1532.81ms │     no change │
│ QQuery 6     │    61.01ms │                57.01ms │ +1.07x faster │
│ QQuery 7     │    77.09ms │                73.02ms │ +1.06x faster │
│ QQuery 8     │  1971.64ms │              1762.88ms │ +1.12x faster │
│ QQuery 9     │  1921.59ms │              1903.47ms │     no change │
│ QQuery 10    │   516.35ms │               499.35ms │     no change │
│ QQuery 11    │   590.99ms │               556.80ms │ +1.06x faster │
│ QQuery 12    │  1814.14ms │              1816.26ms │     no change │
│ QQuery 13    │  2956.07ms │              2954.48ms │     no change │
│ QQuery 14    │  2054.42ms │              1940.82ms │ +1.06x faster │
│ QQuery 15    │  1899.87ms │              1873.73ms │     no change │
│ QQuery 16    │  4066.16ms │              3744.25ms │ +1.09x faster │
│ QQuery 17    │  3629.16ms │              3428.06ms │ +1.06x faster │
│ QQuery 18    │  8282.13ms │              7646.27ms │ +1.08x faster │
│ QQuery 19    │   144.20ms │               146.30ms │     no change │
│ QQuery 20    │  3222.65ms │              3224.85ms │     no change │
│ QQuery 21    │  3924.86ms │              3913.65ms │     no change │
│ QQuery 22    │  9144.86ms │              9022.44ms │     no change │
│ QQuery 23    │ 23875.41ms │             23664.41ms │     no change │
│ QQuery 24    │  1123.53ms │              1132.05ms │     no change │
│ QQuery 25    │  1011.03ms │              1002.87ms │     no change │
│ QQuery 26    │  1326.71ms │              1319.49ms │     no change │
│ QQuery 27    │  4666.49ms │              4662.07ms │     no change │
│ QQuery 28    │ 24069.75ms │             24145.85ms │     no change │
│ QQuery 29    │   902.07ms │               890.73ms │     no change │
│ QQuery 30    │  1813.79ms │              1722.40ms │ +1.05x faster │
│ QQuery 31    │  2008.03ms │              1977.28ms │     no change │
│ QQuery 32    │  7369.56ms │              7601.38ms │     no change │
│ QQuery 33    │  9752.79ms │              9742.50ms │     no change │
│ QQuery 34    │  9716.57ms │              9696.95ms │     no change │
│ QQuery 35    │  2760.71ms │              2244.23ms │ +1.23x faster │
│ QQuery 36    │   255.12ms │               241.01ms │ +1.06x faster │
│ QQuery 37    │   158.70ms │               154.80ms │     no change │
│ QQuery 38    │   155.15ms │               153.09ms │     no change │
│ QQuery 39    │   595.64ms │               587.48ms │     no change │
│ QQuery 40    │    57.09ms │                60.69ms │  1.06x slower │
│ QQuery 41    │    53.32ms │                52.81ms │     no change │
│ QQuery 42    │    65.53ms │                65.13ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘

Dandandan · 2024-10-20T13:26:50Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

-    core(array, row);
+struct AggregationHashTable<T: AggregationHashTableEntry> {
+    /// Raw table storing values in a `Vec`
+    raw_table: Vec<T>,


Based on some experiments in changing hash join algorithm, I think it's likely hashbrown performs much better than implementing a hashtable ourselves although I would like to be surprised 🙂

Based on some experiments in changing hash join algorithm, I think it's likely hashbrown performs much better than implementing a hashtable ourselves although I would like to be surprised 🙂

🤔 Even if we can perform something like vectorized compare or vectorized append in our hashtable?

I found in multi group by case, we will perform the compare for each row leading to the array downcasting again and again... And actually the downcast operation will be compiled to many asm codes....

And I foudn we can't eliminate it and perform the vectorized compare with hashbrown...

fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool { let array = array.as_byte_view::<B>();

We can still do "vectorized compare" by doing the lookup in the hashtable (based on hash value only) and the vectorized equality check separately. That way you still can use the fast hashtable, but move the equality check to a separate/vectorized step.
That's at least what is done in the vectorized hash join implementation :). I changed it before to use a Vec-based index like you did here, but that performed significantly worse.

The reason I think is that the lookup is incredibly well optimized using the swiss table design and you get fewer 'false" candidates to check for, while we can still use the vectorized/type specialized equality check.

Make sense, thank you!

Rachelint · 2024-10-26T03:27:30Z

The logic is a bit complex, I plan to finish and do benchmark for it today.

alamb

I started going through this code -- I am finding it a really nice read. Nice work @Rachelint @jayzhan211 and @Dandandan

My only real high level concern here is that we have to retain the GroupValuesColumn as well -- not only does this now have more code to maintain, but the number of paths to test / verify is getting larger too

Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?

I plan to keep reviewing this over the weekend.

alamb · 2024-11-01T21:41:49Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

+    /// is used to store the rows will be processed in next round.
+    scalarized_indices: Vec<usize>,
+
+    /// The `vectorized_equal_tod` row indices buffer


Maybe we can rename these to "buffer" or something to make it clear they are temp processing space to avoid re-allocations rather than

Something like

buffer_equal_to_row_indices: Vec<usize>,

Or maybe we can even put all the scratch space into their own struct to make it clear

struct ScratchSpace { vectorized_equal_to_row_indices: Vec<usize>, /// The `vectorized_equal_tod` group indices buffer vectorized_equal_to_group_indices: Vec<usize>, /// The `vectorized_equal_tod` result buffer vectorized_equal_to_results: Vec<bool>, /// The `vectorized append` row indices buffer vectorized_append_row_indices: Vec<usize>, }

Or something

Good idea for readability, I defined VectorizedOperationBuffers to hold such buffers.

datafusion/physical-plan/src/aggregates/group_values/column.rs

alamb · 2024-11-01T21:45:09Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

+        groups.clear();
+        groups.resize(n_rows, usize::MAX);
+
+        let mut batch_hashes = mem::take(&mut self.hashes_buffer);


alamb · 2024-11-01T21:48:13Z

datafusion/physical-plan/src/aggregates/group_values/mod.rs

@@ -143,8 +148,12 @@ pub fn new_group_values(schema: SchemaRef) -> Result<Box<dyn GroupValues>> {
        }
    }

-    if GroupValuesColumn::supported_schema(schema.as_ref()) {
-        Ok(Box::new(GroupValuesColumn::try_new(schema)?))
+    if column::supported_schema(schema.as_ref()) {


Can you explain here why GroupOrdering::None is required? Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

If that is the case, it seems like maybe emit_n would never be called 🤔

Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

Yes, because we now process all the rows at once (not one by one like before), some rows are appended beforehand so they are not kept in order.

If that is the case, it seems like maybe emit_n would never be called

emit_early_if_necessary may be called

Can you explain here why GroupOrdering::None is required? Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

If that is the case, it seems like maybe emit_n would never be called 🤔

The situation is just as @jayzhan211 mentioned, and the detail about why GroupOrdering::None needed can also see:
https://github.com/Rachelint/arrow-datafusion/blob/406acb4983efe0c2072c5d7759674eec9db9404a/datafusion/physical-plan/src/aggregates/group_values/column.rs#L792-L834

Rachelint · 2024-11-02T13:32:59Z

Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?

🤔I think It can unify simply, VectorizedGroupValuesColumn::scalarized_intern is similar as GroupValuesColumn::intern.
But its logic is much more complex, I am afraid performance regression of streaming aggregation.

The alternative is that we support a dedicated intern in VectorizedGroupValuesColumn which is totally same as GroupValuesColumn::intern.
It will not so hard to do it, because GroupValuesColumn::intern can be seen as a simpler version of VectorizedGroupValuesColumn::scalarized_intern.

🤔 I personally prefer the second one? What do you think about it @alamb ?

2010YOUY01 · 2024-11-02T15:03:02Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

+    /// And we use [`GroupIndexView`] to represent such `group indices` in table.
+    ///
+    ///
+    map: RawTable<(u64, GroupIndexView)>,


Is it the case

If group1 and group2 have exactly the same hash value, GroupIndexView will use chaining to resolve the collision

If group1 and group2 have different hash values but map to the same slot in hash table, hashbrown will handle the collision for you with probing

Totally right.

…d operations to make code clearer.

alamb · 2024-11-03T12:00:35Z

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

alamb · 2024-11-03T12:01:46Z

BTW I think this code is fairly well covered by the aggregate fuzz tester (also added by @Rachelint :))

Also, @LeslieKid is adding additional data type coverage which is great: #13226

cargo test --test fuzz -- aggregate

Rachelint · 2024-11-04T15:14:35Z

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

Have unified the VectorizedGroupValuesColumn and GroupValuesColumn through the way mentioned in #12996 (comment)

alamb · 2024-11-04T21:24:18Z

This is top of my list to review tomorrow morning

alamb · 2024-11-04T22:06:26Z

This is top of my list to review tomorrow morning

I am sorry -- I am just finding other PRs like #12978 and #13133 very subtle and take a long time to review (aka write tests for / help make sure they are still correct)

jayzhan211

👍

alamb · 2024-11-05T18:26:43Z

I am giving this a final review now

alamb · 2024-11-05T19:24:54Z

Performance results:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.30ms │                 2.33ms │     no change │
│ QQuery 1     │    40.32ms │                40.38ms │     no change │
│ QQuery 2     │    96.98ms │                97.06ms │     no change │
│ QQuery 3     │   106.77ms │               108.36ms │     no change │
│ QQuery 4     │   912.91ms │               923.46ms │     no change │
│ QQuery 5     │   957.04ms │               943.27ms │     no change │
│ QQuery 6     │    36.29ms │                35.65ms │     no change │
│ QQuery 7     │    44.35ms │                44.05ms │     no change │
│ QQuery 8     │  1374.70ms │              1026.74ms │ +1.34x faster │
│ QQuery 9     │  1349.37ms │              1354.92ms │     no change │
│ QQuery 10    │   308.40ms │               287.66ms │ +1.07x faster │
│ QQuery 11    │   358.62ms │               321.64ms │ +1.11x faster │
│ QQuery 12    │  1003.91ms │               981.15ms │     no change │
│ QQuery 13    │  1542.79ms │              1470.92ms │     no change │
│ QQuery 14    │  1076.66ms │               913.23ms │ +1.18x faster │
│ QQuery 15    │  1080.68ms │              1107.04ms │     no change │
│ QQuery 16    │  2434.58ms │              1986.02ms │ +1.23x faster │
│ QQuery 17    │  2243.82ms │              1854.36ms │ +1.21x faster │
│ QQuery 18    │  5145.29ms │              4294.07ms │ +1.20x faster │
│ QQuery 19    │    98.01ms │               100.58ms │     no change │
│ QQuery 20    │  1259.01ms │              1273.34ms │     no change │
│ QQuery 21    │  1524.57ms │              1495.15ms │     no change │
│ QQuery 22    │  2711.65ms │              2661.01ms │     no change │
│ QQuery 23    │  8991.12ms │              8565.66ms │     no change │
│ QQuery 24    │   521.71ms │               515.62ms │     no change │
│ QQuery 25    │   434.70ms │               423.71ms │     no change │
│ QQuery 26    │   594.60ms │               584.15ms │     no change │
│ QQuery 27    │  1884.39ms │              1857.91ms │     no change │
│ QQuery 28    │ 12978.56ms │             13103.89ms │     no change │
│ QQuery 29    │   530.69ms │               538.63ms │     no change │
│ QQuery 30    │  1023.13ms │               897.27ms │ +1.14x faster │
│ QQuery 31    │  1044.11ms │               956.21ms │ +1.09x faster │
│ QQuery 32    │  4300.17ms │              4064.21ms │ +1.06x faster │
│ QQuery 33    │  4063.10ms │              4043.20ms │     no change │
│ QQuery 34    │  4084.94ms │              4073.62ms │     no change │
│ QQuery 35    │  1926.27ms │              1355.58ms │ +1.42x faster │
│ QQuery 36    │   239.44ms │               231.08ms │     no change │
│ QQuery 37    │    96.47ms │                97.42ms │     no change │
│ QQuery 38    │   140.95ms │               142.52ms │     no change │
│ QQuery 39    │   513.28ms │               443.86ms │ +1.16x faster │
│ QQuery 40    │    57.47ms │                55.78ms │     no change │
│ QQuery 41    │    48.74ms │                50.94ms │     no change │
│ QQuery 42    │    62.29ms │                63.98ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 69245.17ms │
│ Total Time (vectorize-append-value)   │ 65387.65ms │
│ Average Time (main_base)              │  1610.35ms │
│ Average Time (vectorize-append-value) │  1520.64ms │
│ Queries Faster                        │         12 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │         31 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2783.88ms │              2821.00ms │     no change │
│ QQuery 1     │   690.58ms │               679.33ms │     no change │
│ QQuery 2     │  1435.85ms │              1364.87ms │     no change │
│ QQuery 3     │   781.53ms │               708.00ms │ +1.10x faster │
│ QQuery 4     │ 12395.12ms │             12441.79ms │     no change │
│ QQuery 5     │ 19443.67ms │             19077.87ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 37530.62ms │
│ Total Time (vectorize-append-value)   │ 37092.86ms │
│ Average Time (main_base)              │  6255.10ms │
│ Average Time (vectorize-append-value) │  6182.14ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  227.03ms │               223.21ms │     no change │
│ QQuery 2     │  117.37ms │               118.32ms │     no change │
│ QQuery 3     │  131.23ms │               112.98ms │ +1.16x faster │
│ QQuery 4     │   80.11ms │                82.46ms │     no change │
│ QQuery 5     │  161.69ms │               157.51ms │     no change │
│ QQuery 6     │   43.57ms │                43.28ms │     no change │
│ QQuery 7     │  208.38ms │               195.13ms │ +1.07x faster │
│ QQuery 8     │  168.85ms │               163.41ms │     no change │
│ QQuery 9     │  246.40ms │               245.05ms │     no change │
│ QQuery 10    │  203.72ms │               205.29ms │     no change │
│ QQuery 11    │   94.72ms │                92.35ms │     no change │
│ QQuery 12    │  100.17ms │               115.46ms │  1.15x slower │
│ QQuery 13    │  212.30ms │               208.76ms │     no change │
│ QQuery 14    │   83.73ms │                70.51ms │ +1.19x faster │
│ QQuery 15    │  104.72ms │               112.26ms │  1.07x slower │
│ QQuery 16    │   72.68ms │                69.40ms │     no change │
│ QQuery 17    │  202.88ms │               208.88ms │     no change │
│ QQuery 18    │  309.89ms │               322.50ms │     no change │
│ QQuery 19    │  121.13ms │               118.12ms │     no change │
│ QQuery 20    │  139.03ms │               122.02ms │ +1.14x faster │
│ QQuery 21    │  260.01ms │               253.21ms │     no change │
│ QQuery 22    │   67.71ms │                67.18ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘

🚀

alamb

👏 @Rachelint @jayzhan211 @2010YOUY01 and @Dandandan. What great teamwork

This PR is really nice in my opinion. It makes a super tricky and performance sensitive part of the code about as clear as I could imagine it to be.

I also ran some code coverage on this

nice cargo llvm-cov --html test --test fuzz -- aggregate
nice cargo llvm-cov --html test -p datafusion-physical-plan -- group_values

And verified that the new code was well covered

alamb · 2024-11-05T18:29:49Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

@@ -75,55 +148,653 @@ pub struct GroupValuesColumn {
    random_state: RandomState,
 }

-impl GroupValuesColumn {
+/// Buffers to store intermediate results in `vectorized_append`


alamb · 2024-11-05T18:47:04Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

+    // ========================================================================
+    // Initialization functions
+    // ========================================================================
+
    /// Create a new instance of GroupValuesColumn if supported for the specified schema
    pub fn try_new(schema: SchemaRef) -> Result<Self> {
        let map = RawTable::with_capacity(0);


This with_capacity can probably be improved (as a follow on PR) to avoid some smaller allocations

alamb · 2024-11-05T18:48:57Z

datafusion/physical-plan/src/aggregates/group_values/column.rs

+    /// `Group indices` order are against with their input order, and this will lead to error
+    /// in `streaming aggregation`.
+    ///
+    fn scalarized_intern(


this is basically the same as GroupValuesColumn::intern was previously, which makes sense to me

alamb · 2024-11-05T19:00:14Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

@@ -56,14 +59,40 @@ pub trait GroupColumn: Send + Sync {
    ///
    /// Note that this comparison returns true if both elements are NULL
    fn equal_to(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool;
+
    /// Appends the row at `row` in `array` to this builder
    fn append_val(&mut self, array: &ArrayRef, row: usize);


Maybe as a follow on we can consider removing append_val and equal_to and simpl change all codepaths to use the vectorized version

I am a bit worried about if we merge them, some extra if else will be introduced.
It hurt much for performance for the row level operation.

A good thing to benchmark (as a follow on PR) perhaps

alamb · 2024-11-05T19:01:13Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+    /// it will record the `true` result at the corresponding
+    /// position in `equal_to_results`.
+    ///
+    /// And if found nth result in `equal_to_results` is already


this is quite clever to pass in the existing "is equal to results"

alamb · 2024-11-05T19:02:35Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+            (false, _) => {
+                for &row in rows {
+                    self.group_values.push(arr.value(row));


that uf possible the inner loop just looks like this (memcopy!)

😆 I think we can even do more, like check if rows.len() == array.len(), if so we just perform extend.

I think we already could use extend instead of push? extend on Vec is somewhat faster than push as the capacity check / allocation is done once instead of once per value.

I think there are several things that could be done to make the append even faster:

extend_from_slice if rows.len() == array.len()

use extend rather than push for values

Speed up appending nulls (don't append bits one by one)

I think we already could use extend instead of push? extend on Vec is somewhat faster than push as the capacity check / allocation is done once instead of once per value.

Ok, I got it, I think again and found it indeed simple to do it!

I think there are several things that could be done to make the append even faster:

1. `extend_from_slice` `if rows.len() == array.len()` 2. use `extend` rather than `push` for values 3. Speed up appending nulls (don't append bits one by one)

I filed an issue to tracking the potential improvements for vecotrized operations.
#13275

alamb · 2024-11-05T19:35:58Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

@@ -287,6 +469,63 @@ where
        };
    }

+    fn vectorized_equal_to(


What i have been dreaming about with @XiangpengHao is maybe something like adding take / filter to arrow array builders

I took this opportunity to write up the idea (finally) for your amusement:

Optimize take/filter from multiple input arrays to a single large output array arrow-rs#6692

alamb · 2024-11-05T19:53:32Z

As my admittedly sparse help for this PR I have filed some additional tickets for follow on work after this PR is merged:

Reorganize the GroupColumn implementations into more coherent modules #13262
Implement Specialized GroupColumn for Date/Time/Timestamp types for multi-column GROUP BY #13263

alamb · 2024-11-06T16:10:01Z

I don't think we need to wait on this PR anymore, let's merge it in and keep moving forward. Thank you everyone again!

Rachelint added 3 commits October 17, 2024 02:36

simple support vectorized append.

be6a67d

fix tests.

2cdf05d

some logs.

04ea2d2

github-actions bot added the physical-expr Physical Expressions label Oct 18, 2024

Rachelint mentioned this pull request Oct 18, 2024

Remove unnecessary null checks in GroupColumns #12944

Open

Dandandan reviewed Oct 18, 2024

View reviewed changes

Rachelint added 3 commits October 19, 2024 17:01

add append_n in MaybeNullBufferBuilder.

a83c2ea

impl basic append_batch

3df75ac

fix equal to.

13c9489

Rachelint changed the title ~~POC: Vectorize append value~~ POC: Vectorized hashtable for aggregation Oct 20, 2024

Dandandan reviewed Oct 20, 2024

View reviewed changes

Rachelint added 2 commits October 22, 2024 02:40

define GroupIndexContext.

5fd63e8

define the structs useful in vectorizing.

d4b5820

Rachelint force-pushed the vectorize-append-value branch from 601c7b2 to d4b5820 Compare October 21, 2024 18:59

Rachelint added 9 commits October 22, 2024 13:55

re-define some structs for vectorized operations.

04f35bb

impl some vectorized logics.

d215937

impl chekcing hashmap stage.

2af6ff5

fix compile.

473914a

tmp

14f8881

define and impl vectorized_compare.

ebbeb5a

fix compile.

dad79c0

impl vectorized_equal_to.

1a7c2eb

impl vectorized_append.

d79b813

Rachelint force-pushed the vectorize-append-value branch from 3415659 to d79b813 Compare October 23, 2024 07:36

finish the basic vectorized ops logic.

6edc646

Rachelint added 3 commits October 26, 2024 16:55

impl take_n.

150248f

fix renaming clear and groups fill.

37d68e6

fix death loop due to rehashing.

ebd9db9

alamb reviewed Nov 1, 2024

View reviewed changes

2010YOUY01 reviewed Nov 2, 2024

View reviewed changes

Rachelint added 2 commits November 3, 2024 01:55

define VectorizedOperationBuffers to hold buffers used in vectorize…

7a1ed90

…d operations to make code clearer.

Merge branch 'main' into vectorize-append-value

e8c0aaa

unify VectorizedGroupValuesColumn and GroupValuesColumn.

2d982a1

Rachelint force-pushed the vectorize-append-value branch from 695d29c to 2d982a1 Compare November 4, 2024 15:12

fix fmt.

e4bd579

Rachelint force-pushed the vectorize-append-value branch from e0635fe to c6f8074 Compare November 4, 2024 15:22

fix comments.

14fffb8

Rachelint force-pushed the vectorize-append-value branch from c6f8074 to 14fffb8 Compare November 4, 2024 15:24

fix clippy.

d479cc2

alamb mentioned this pull request Nov 4, 2024

[EPIC] Improvements to GroupColumn multi-column aggregation performance #12680

Open

14 tasks

jayzhan211 approved these changes Nov 4, 2024

View reviewed changes

alamb mentioned this pull request Nov 5, 2024

[Epic] High cardinality aggregation performance wishlist #11679

Open

4 tasks

alamb mentioned this pull request Nov 5, 2024

Optimize take/filter from multiple input arrays to a single large output array apache/arrow-rs#6692

Open

alamb approved these changes Nov 5, 2024

View reviewed changes

alamb mentioned this pull request Nov 5, 2024

Reorganize the GroupColumn implementations into more coherent modules #13262

Open

alamb mentioned this pull request Nov 5, 2024

Nov 5. 2024: This week in DataFusion #13265

Open

3 tasks

Rachelint mentioned this pull request Nov 6, 2024

Improve vectorized operations of GroupColumn #13275

Open

alamb merged commit 345117b into apache:main Nov 6, 2024
25 checks passed

Support vectorized append and compare for multi group by #12996

Support vectorized append and compare for multi group by #12996

Conversation

Rachelint commented Oct 18, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Rachelint Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Rachelint commented Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Oct 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint commented Oct 26, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint commented Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 3, 2024

alamb commented Nov 3, 2024 • edited Loading

Rachelint commented Nov 4, 2024

alamb commented Nov 4, 2024

alamb commented Nov 4, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Nov 5, 2024

alamb commented Nov 5, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 5, 2024

alamb commented Nov 6, 2024

Rachelint commented Oct 18, 2024 •

edited by alamb

Loading

Dandandan Oct 18, 2024 •

edited

Loading

Rachelint Oct 19, 2024 •

edited

Loading

Rachelint Oct 19, 2024 •

edited

Loading

Rachelint commented Oct 19, 2024 •

edited

Loading

Dandandan Oct 20, 2024 •

edited

Loading

jayzhan211 Nov 1, 2024 •

edited

Loading

Rachelint commented Nov 2, 2024 •

edited

Loading

alamb commented Nov 3, 2024 •

edited

Loading

Rachelint Nov 6, 2024 •

edited

Loading