Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vectorized append and compare for multi group by #12996

Merged
merged 63 commits into from
Nov 6, 2024

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Oct 18, 2024

Which issue does this PR close?

Closes #.

Related to

Rationale for this change

Although GroupValuesColumn is stored the multi gourp by values in column oriented way.

However, it still use row oriented approach to perform append and equal to.

The most obvious overhead is that we need to downcast the array when processing each row, and instructions for downcast is actually not few, and even worse it will introduce branches.
And as I guess, the row oriented approach will also increase the random memory accesses but I am not sure.

What changes are included in this PR?

This pr introduce the vectorized append and vectorized equal to for GroupValuesColumn.

But such vectorized appoach is not compatible with streaming aggregation depending on the order between input rows and their corresponding gourp indices.

So I define a new VectorizedGroupValuesColumn for optimizing non streaming aggregation cases, and keep the original GroupValuesColumn for the streaming aggregation cases.

Are these changes tested?

Yes, I think enough new unit tests are added.

Are there any user-facing changes?

No.

@@ -128,6 +132,15 @@ impl<T: ArrowPrimitiveType, const NULLABLE: bool> GroupColumn
}
}

fn append_non_nullable_val(&mut self, array: &ArrayRef, row: usize) {
if NULLABLE {
self.nulls.append(false);
Copy link
Contributor

@Dandandan Dandandan Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized to append nulls for entire batch instead of per value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to refactor the interface for supporting input a rows: &[usize].
And make all parts' appending vectorized, and see the performance again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool :)

Copy link
Contributor Author

@Rachelint Rachelint Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the append_batch function to support vectorized append more better.
But the improvement seems still not obvious. #12996 (comment)

🤔 I guess, it is likely due the new introduced branch of equal_to:

                if *group_idx < group_values_len {
                    for (i, group_val) in self.group_values.iter().enumerate() {
                        if !check_row_equal(group_val.as_ref(), *group_idx, &cols[i], row)
                        {
                            return false;
                        }
                    }
                } else {
                    let row_idx_offset = group_idx - group_values_len;
                    let row_idx = self.append_rows_buffer[row_idx_offset];
                    return is_rows_eq(cols, row, cols, row_idx).unwrap();
                }

Copy link
Contributor Author

@Rachelint Rachelint Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To eliminate this extra branch, I think we need to refactor the intern process metioned in #12821 (comment)

I am trying it.

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 19, 2024

The latest benchmark numbers:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.67ms │                 0.68ms │     no change │
│ QQuery 1     │    67.01ms │                65.25ms │     no change │
│ QQuery 2     │   165.14ms │               157.75ms │     no change │
│ QQuery 3     │   181.43ms │               181.83ms │     no change │
│ QQuery 4     │  1566.65ms │              1574.95ms │     no change │
│ QQuery 5     │  1539.79ms │              1532.81ms │     no change │
│ QQuery 6     │    61.01ms │                57.01ms │ +1.07x faster │
│ QQuery 7     │    77.09ms │                73.02ms │ +1.06x faster │
│ QQuery 8     │  1971.64ms │              1762.88ms │ +1.12x faster │
│ QQuery 9     │  1921.59ms │              1903.47ms │     no change │
│ QQuery 10    │   516.35ms │               499.35ms │     no change │
│ QQuery 11    │   590.99ms │               556.80ms │ +1.06x faster │
│ QQuery 12    │  1814.14ms │              1816.26ms │     no change │
│ QQuery 13    │  2956.07ms │              2954.48ms │     no change │
│ QQuery 14    │  2054.42ms │              1940.82ms │ +1.06x faster │
│ QQuery 15    │  1899.87ms │              1873.73ms │     no change │
│ QQuery 16    │  4066.16ms │              3744.25ms │ +1.09x faster │
│ QQuery 17    │  3629.16ms │              3428.06ms │ +1.06x faster │
│ QQuery 18    │  8282.13ms │              7646.27ms │ +1.08x faster │
│ QQuery 19    │   144.20ms │               146.30ms │     no change │
│ QQuery 20    │  3222.65ms │              3224.85ms │     no change │
│ QQuery 21    │  3924.86ms │              3913.65ms │     no change │
│ QQuery 22    │  9144.86ms │              9022.44ms │     no change │
│ QQuery 23    │ 23875.41ms │             23664.41ms │     no change │
│ QQuery 24    │  1123.53ms │              1132.05ms │     no change │
│ QQuery 25    │  1011.03ms │              1002.87ms │     no change │
│ QQuery 26    │  1326.71ms │              1319.49ms │     no change │
│ QQuery 27    │  4666.49ms │              4662.07ms │     no change │
│ QQuery 28    │ 24069.75ms │             24145.85ms │     no change │
│ QQuery 29    │   902.07ms │               890.73ms │     no change │
│ QQuery 30    │  1813.79ms │              1722.40ms │ +1.05x faster │
│ QQuery 31    │  2008.03ms │              1977.28ms │     no change │
│ QQuery 32    │  7369.56ms │              7601.38ms │     no change │
│ QQuery 33    │  9752.79ms │              9742.50ms │     no change │
│ QQuery 34    │  9716.57ms │              9696.95ms │     no change │
│ QQuery 35    │  2760.71ms │              2244.23ms │ +1.23x faster │
│ QQuery 36    │   255.12ms │               241.01ms │ +1.06x faster │
│ QQuery 37    │   158.70ms │               154.80ms │     no change │
│ QQuery 38    │   155.15ms │               153.09ms │     no change │
│ QQuery 39    │   595.64ms │               587.48ms │     no change │
│ QQuery 40    │    57.09ms │                60.69ms │  1.06x slower │
│ QQuery 41    │    53.32ms │                52.81ms │     no change │
│ QQuery 42    │    65.53ms │                65.13ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘

@Rachelint Rachelint changed the title POC: Vectorize append value POC: Vectorized hashtable for aggregation Oct 20, 2024
core(array, row);
struct AggregationHashTable<T: AggregationHashTableEntry> {
/// Raw table storing values in a `Vec`
raw_table: Vec<T>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some experiments in changing hash join algorithm, I think it's likely hashbrown performs much better than implementing a hashtable ourselves although I would like to be surprised 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some experiments in changing hash join algorithm, I think it's likely hashbrown performs much better than implementing a hashtable ourselves although I would like to be surprised 🙂

🤔 Even if we can perform something like vectorized compare or vectorized append in our hashtable?

I found in multi group by case, we will perform the compare for each row leading to the array downcasting again and again... And actually the downcast operation will be compiled to many asm codes....

And I foudn we can't eliminate it and perform the vectorized compare with hashbrown...

    fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool {
        let array = array.as_byte_view::<B>();

Copy link
Contributor

@Dandandan Dandandan Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still do "vectorized compare" by doing the lookup in the hashtable (based on hash value only) and the vectorized equality check separately. That way you still can use the fast hashtable, but move the equality check to a separate/vectorized step.
That's at least what is done in the vectorized hash join implementation :). I changed it before to use a Vec-based index like you did here, but that performed significantly worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think is that the lookup is incredibly well optimized using the swiss table design and you get fewer 'false" candidates to check for, while we can still use the vectorized/type specialized equality check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thank you!

@Rachelint
Copy link
Contributor Author

The logic is a bit complex, I plan to finish and do benchmark for it today.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started going through this code -- I am finding it a really nice read. Nice work @Rachelint @jayzhan211 and @Dandandan

My only real high level concern here is that we have to retain the GroupValuesColumn as well -- not only does this now have more code to maintain, but the number of paths to test / verify is getting larger too

Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?

I plan to keep reviewing this over the weekend.

/// is used to store the rows will be processed in next round.
scalarized_indices: Vec<usize>,

/// The `vectorized_equal_tod` row indices buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rename these to "buffer" or something to make it clear they are temp processing space to avoid re-allocations rather than

Something like

    buffer_equal_to_row_indices: Vec<usize>,

Or maybe we can even put all the scratch space into their own struct to make it clear

struct ScratchSpace {
    vectorized_equal_to_row_indices: Vec<usize>,

    /// The `vectorized_equal_tod` group indices buffer
    vectorized_equal_to_group_indices: Vec<usize>,

    /// The `vectorized_equal_tod` result buffer
    vectorized_equal_to_results: Vec<bool>,

    /// The `vectorized append` row indices buffer
    vectorized_append_row_indices: Vec<usize>,
}

Or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea for readability, I defined VectorizedOperationBuffers to hold such buffers.

groups.clear();
groups.resize(n_rows, usize::MAX);

let mut batch_hashes = mem::take(&mut self.hashes_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -143,8 +148,12 @@ pub fn new_group_values(schema: SchemaRef) -> Result<Box<dyn GroupValues>> {
}
}

if GroupValuesColumn::supported_schema(schema.as_ref()) {
Ok(Box::new(GroupValuesColumn::try_new(schema)?))
if column::supported_schema(schema.as_ref()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain here why GroupOrdering::None is required? Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

If that is the case, it seems like maybe emit_n would never be called 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

Yes, because we now process all the rows at once (not one by one like before), some rows are appended beforehand so they are not kept in order.

If that is the case, it seems like maybe emit_n would never be called

emit_early_if_necessary may be called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain here why GroupOrdering::None is required? Is it because the VectorizedGroupValuesColumn doesn't keep the groups in order?

If that is the case, it seems like maybe emit_n would never be called 🤔

The situation is just as @jayzhan211 mentioned, and the detail about why GroupOrdering::None needed can also see:
https://github.com/Rachelint/arrow-datafusion/blob/406acb4983efe0c2072c5d7759674eec9db9404a/datafusion/physical-plan/src/aggregates/group_values/column.rs#L792-L834

@Rachelint
Copy link
Contributor Author

Rachelint commented Nov 2, 2024

Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?

🤔I think It can unify simply, VectorizedGroupValuesColumn::scalarized_intern is similar as GroupValuesColumn::intern.
But its logic is much more complex, I am afraid performance regression of streaming aggregation.

The alternative is that we support a dedicated intern in VectorizedGroupValuesColumn which is totally same as GroupValuesColumn::intern.
It will not so hard to do it, because GroupValuesColumn::intern can be seen as a simpler version of VectorizedGroupValuesColumn::scalarized_intern.

🤔 I personally prefer the second one? What do you think about it @alamb ?

/// And we use [`GroupIndexView`] to represent such `group indices` in table.
///
///
map: RawTable<(u64, GroupIndexView)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case

  1. If group1 and group2 have exactly the same hash value, GroupIndexView will use chaining to resolve the collision
  2. If group1 and group2 have different hash values but map to the same slot in hash table, hashbrown will handle the collision for you with probing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally right.

@alamb
Copy link
Contributor

alamb commented Nov 3, 2024

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

@alamb
Copy link
Contributor

alamb commented Nov 3, 2024

BTW I think this code is fairly well covered by the aggregate fuzz tester (also added by @Rachelint :))

Also, @LeslieKid is adding additional data type coverage which is great: #13226

cargo test --test fuzz -- aggregate

@Rachelint
Copy link
Contributor Author

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

Have unified the VectorizedGroupValuesColumn and GroupValuesColumn through the way mentioned in #12996 (comment)

@alamb
Copy link
Contributor

alamb commented Nov 4, 2024

This is top of my list to review tomorrow morning

@alamb
Copy link
Contributor

alamb commented Nov 4, 2024

This is top of my list to review tomorrow morning

I am sorry -- I am just finding other PRs like #12978 and #13133 very subtle and take a long time to review (aka write tests for / help make sure they are still correct)

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

I am giving this a final review now

@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

Performance results:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.30ms │                 2.33ms │     no change │
│ QQuery 1     │    40.32ms │                40.38ms │     no change │
│ QQuery 2     │    96.98ms │                97.06ms │     no change │
│ QQuery 3     │   106.77ms │               108.36ms │     no change │
│ QQuery 4     │   912.91ms │               923.46ms │     no change │
│ QQuery 5     │   957.04ms │               943.27ms │     no change │
│ QQuery 6     │    36.29ms │                35.65ms │     no change │
│ QQuery 7     │    44.35ms │                44.05ms │     no change │
│ QQuery 8     │  1374.70ms │              1026.74ms │ +1.34x faster │
│ QQuery 9     │  1349.37ms │              1354.92ms │     no change │
│ QQuery 10    │   308.40ms │               287.66ms │ +1.07x faster │
│ QQuery 11    │   358.62ms │               321.64ms │ +1.11x faster │
│ QQuery 12    │  1003.91ms │               981.15ms │     no change │
│ QQuery 13    │  1542.79ms │              1470.92ms │     no change │
│ QQuery 14    │  1076.66ms │               913.23ms │ +1.18x faster │
│ QQuery 15    │  1080.68ms │              1107.04ms │     no change │
│ QQuery 16    │  2434.58ms │              1986.02ms │ +1.23x faster │
│ QQuery 17    │  2243.82ms │              1854.36ms │ +1.21x faster │
│ QQuery 18    │  5145.29ms │              4294.07ms │ +1.20x faster │
│ QQuery 19    │    98.01ms │               100.58ms │     no change │
│ QQuery 20    │  1259.01ms │              1273.34ms │     no change │
│ QQuery 21    │  1524.57ms │              1495.15ms │     no change │
│ QQuery 22    │  2711.65ms │              2661.01ms │     no change │
│ QQuery 23    │  8991.12ms │              8565.66ms │     no change │
│ QQuery 24    │   521.71ms │               515.62ms │     no change │
│ QQuery 25    │   434.70ms │               423.71ms │     no change │
│ QQuery 26    │   594.60ms │               584.15ms │     no change │
│ QQuery 27    │  1884.39ms │              1857.91ms │     no change │
│ QQuery 28    │ 12978.56ms │             13103.89ms │     no change │
│ QQuery 29    │   530.69ms │               538.63ms │     no change │
│ QQuery 30    │  1023.13ms │               897.27ms │ +1.14x faster │
│ QQuery 31    │  1044.11ms │               956.21ms │ +1.09x faster │
│ QQuery 32    │  4300.17ms │              4064.21ms │ +1.06x faster │
│ QQuery 33    │  4063.10ms │              4043.20ms │     no change │
│ QQuery 34    │  4084.94ms │              4073.62ms │     no change │
│ QQuery 35    │  1926.27ms │              1355.58ms │ +1.42x faster │
│ QQuery 36    │   239.44ms │               231.08ms │     no change │
│ QQuery 37    │    96.47ms │                97.42ms │     no change │
│ QQuery 38    │   140.95ms │               142.52ms │     no change │
│ QQuery 39    │   513.28ms │               443.86ms │ +1.16x faster │
│ QQuery 40    │    57.47ms │                55.78ms │     no change │
│ QQuery 41    │    48.74ms │                50.94ms │     no change │
│ QQuery 42    │    62.29ms │                63.98ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 69245.17ms │
│ Total Time (vectorize-append-value)   │ 65387.65ms │
│ Average Time (main_base)              │  1610.35ms │
│ Average Time (vectorize-append-value) │  1520.64ms │
│ Queries Faster                        │         12 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │         31 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2783.88ms │              2821.00ms │     no change │
│ QQuery 1     │   690.58ms │               679.33ms │     no change │
│ QQuery 2     │  1435.85ms │              1364.87ms │     no change │
│ QQuery 3     │   781.53ms │               708.00ms │ +1.10x faster │
│ QQuery 4     │ 12395.12ms │             12441.79ms │     no change │
│ QQuery 5     │ 19443.67ms │             19077.87ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 37530.62ms │
│ Total Time (vectorize-append-value)   │ 37092.86ms │
│ Average Time (main_base)              │  6255.10ms │
│ Average Time (vectorize-append-value) │  6182.14ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  227.03ms │               223.21ms │     no change │
│ QQuery 2     │  117.37ms │               118.32ms │     no change │
│ QQuery 3     │  131.23ms │               112.98ms │ +1.16x faster │
│ QQuery 4     │   80.11ms │                82.46ms │     no change │
│ QQuery 5     │  161.69ms │               157.51ms │     no change │
│ QQuery 6     │   43.57ms │                43.28ms │     no change │
│ QQuery 7     │  208.38ms │               195.13ms │ +1.07x faster │
│ QQuery 8     │  168.85ms │               163.41ms │     no change │
│ QQuery 9     │  246.40ms │               245.05ms │     no change │
│ QQuery 10    │  203.72ms │               205.29ms │     no change │
│ QQuery 11    │   94.72ms │                92.35ms │     no change │
│ QQuery 12    │  100.17ms │               115.46ms │  1.15x slower │
│ QQuery 13    │  212.30ms │               208.76ms │     no change │
│ QQuery 14    │   83.73ms │                70.51ms │ +1.19x faster │
│ QQuery 15    │  104.72ms │               112.26ms │  1.07x slower │
│ QQuery 16    │   72.68ms │                69.40ms │     no change │
│ QQuery 17    │  202.88ms │               208.88ms │     no change │
│ QQuery 18    │  309.89ms │               322.50ms │     no change │
│ QQuery 19    │  121.13ms │               118.12ms │     no change │
│ QQuery 20    │  139.03ms │               122.02ms │ +1.14x faster │
│ QQuery 21    │  260.01ms │               253.21ms │     no change │
│ QQuery 22    │   67.71ms │                67.18ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘

🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 @Rachelint @jayzhan211 @2010YOUY01 and @Dandandan. What great teamwork

This PR is really nice in my opinion. It makes a super tricky and performance sensitive part of the code about as clear as I could imagine it to be.

I also ran some code coverage on this

nice cargo llvm-cov --html test --test fuzz -- aggregate
nice cargo llvm-cov --html test -p datafusion-physical-plan -- group_values

And verified that the new code was well covered

@@ -75,55 +148,653 @@ pub struct GroupValuesColumn {
random_state: RandomState,
}

impl GroupValuesColumn {
/// Buffers to store intermediate results in `vectorized_append`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// ========================================================================
// Initialization functions
// ========================================================================

/// Create a new instance of GroupValuesColumn if supported for the specified schema
pub fn try_new(schema: SchemaRef) -> Result<Self> {
let map = RawTable::with_capacity(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This with_capacity can probably be improved (as a follow on PR) to avoid some smaller allocations

/// `Group indices` order are against with their input order, and this will lead to error
/// in `streaming aggregation`.
///
fn scalarized_intern(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically the same as GroupValuesColumn::intern was previously, which makes sense to me

@@ -56,14 +59,40 @@ pub trait GroupColumn: Send + Sync {
///
/// Note that this comparison returns true if both elements are NULL
fn equal_to(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool;

/// Appends the row at `row` in `array` to this builder
fn append_val(&mut self, array: &ArrayRef, row: usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow on we can consider removing append_val and equal_to and simpl change all codepaths to use the vectorized version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried about if we merge them, some extra if else will be introduced.
It hurt much for performance for the row level operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good thing to benchmark (as a follow on PR) perhaps

/// it will record the `true` result at the corresponding
/// position in `equal_to_results`.
///
/// And if found nth result in `equal_to_results` is already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite clever to pass in the existing "is equal to results"


(false, _) => {
for &row in rows {
self.group_values.push(arr.value(row));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so-beautiful

that uf possible the inner loop just looks like this (memcopy!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 I think we can even do more, like check if rows.len() == array.len(), if so we just perform extend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already could use extend instead of push? extend on Vec is somewhat faster than push as the capacity check / allocation is done once instead of once per value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several things that could be done to make the append even faster:

  1. extend_from_slice if rows.len() == array.len()
  2. use extend rather than push for values
  3. Speed up appending nulls (don't append bits one by one)

Copy link
Contributor Author

@Rachelint Rachelint Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already could use extend instead of push? extend on Vec is somewhat faster than push as the capacity check / allocation is done once instead of once per value.

Ok, I got it, I think again and found it indeed simple to do it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several things that could be done to make the append even faster:

1. `extend_from_slice` `if rows.len() == array.len()`

2. use `extend` rather than `push` for values

3. Speed up appending nulls (don't append bits one by one)

I filed an issue to tracking the potential improvements for vecotrized operations.
#13275

@@ -287,6 +469,63 @@ where
};
}

fn vectorized_equal_to(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i have been dreaming about with @XiangpengHao is maybe something like adding take / filter to arrow array builders

I took this opportunity to write up the idea (finally) for your amusement:

@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

As my admittedly sparse help for this PR I have filed some additional tickets for follow on work after this PR is merged:

@alamb
Copy link
Contributor

alamb commented Nov 6, 2024

I don't think we need to wait on this PR anymore, let's merge it in and keep moving forward. Thank you everyone again!

@alamb alamb merged commit 345117b into apache:main Nov 6, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate physical-expr Physical Expressions proto Related to proto crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants