Implement `GroupColumn` support for `StringView` / `ByteView` (faster grouping performance) #12809

Rachelint · 2024-10-08T10:59:13Z

Which issue does this PR close?

Rationale for this change

The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.

What changes are included in this PR?

Support new excellent column based multi group values for byte view column.

Are these changes tested?

Yes, test by new unit tests and e2e tests (most of them helped by @alamb )

Are there any user-facing changes?

No.

alamb

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Rachelint · 2024-10-11T17:36:31Z

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Thanks @alamb , I am working on implementing the rest main function take_n and build now.
A bit busy recent few days... I will help continue to push this forward from today.

Rachelint · 2024-10-11T20:22:38Z

The rest work is to add tests.
Will finish it today.

alamb · 2024-10-11T23:32:24Z

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Rachelint · 2024-10-12T05:33:42Z

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Thanks, it helps much!

Rachelint · 2024-10-12T17:31:45Z

take_n is actually complex, I fixed ByteViewGroupValueBuilder::take_n

It is close to be ready, let's add more unit testcases before.

alamb

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report:
coverage.zip

In general very nice job with coverage. There are a few items that appear to be untested:

I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

alamb · 2024-10-13T12:04:40Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+        // The `n == len` case, we need to take all
+        if self.len() == n {
+            let new_builder = Self::new().with_max_block_size(self.max_block_size);


alamb · 2024-10-13T12:05:22Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        //   - Get the last related `buffer index`(let's name it `buffer index n`)
+        //     from last non-inlined `view`
+        //
+        //   - Take buffers, the key is that we need to know if we need to take


Thank you for these comments. Very nice 💯

alamb · 2024-10-13T12:10:52Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        //   6. Take non-inlined + while last buffer in ``in_progress`
+        //   7. Take all views at once
+
+        let mut builder =


alamb · 2024-10-13T12:38:29Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+        if let Some(view) = last_non_inlined_view {
+            let view = ByteView::from(*view);
+            let last_related_buffer_index = view.buffer_index as usize;


I think a name like last_remaining_buffer_index might be clearer about what this quantity represents

alamb · 2024-10-13T12:39:42Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+            // Build array and return
+            let views = ScalarBuffer::from(first_n_views);
+            Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer))


as above, I think we should use new_unchecked here as all the data is valid by construction (maybe we could keep the check in debug builds)

alamb · 2024-10-13T12:40:51Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+            .rev()
+            .find(|view| ((**view) as u32) > 12);
+
+        if let Some(view) = last_non_inlined_view {


Stylistically, you could reduce the indenting in this function by using a let else, like

let Some(view) = last_non_inlined_view else { let views = ScalarBuffer::from(first_n_views); return Arc::new(GenericByteViewArray::<B>::new( views, Vec::new(), null_buffer, )) }

alamb · 2024-10-13T12:42:11Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+    fn take_buffers_with_partial_last(
+        &mut self,
+        last_related_buffer_index: usize,
+        take_len: usize,


maybe we could call this last_take_len or something to note it is the number of bytes being taken from the last buffer

Do not re-validate output is utf8

Rachelint · 2024-10-13T13:50:16Z

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:
cargo llvm-cov --html  -p datafusion-physical-plan --lib
Here is the report: coverage.zip
...
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

Comments about readability improvement are fixed.

I am adding test for better test coverage.

Rachelint · 2024-10-13T17:57:10Z

@alamb 👍 Thanks for reminding about the test coverage.

After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected.

I have refined the tests for equal_to and take_n, and all related paths are covered according to the report now!

alamb · 2024-10-15T10:42:55Z

I plan to complete my review today (sorry I was out yesterday)

alamb

I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )

Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench

alamb · 2024-10-15T17:23:46Z

I merged up from main and will plan to merge this PR tomorrow unless there is anyone else who would like time to review

FYI @XiangpengHao and @Dandandan and @jayzhan211

Rachelint · 2024-10-15T17:44:48Z

I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )

Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench

Really exciting!

Dandandan · 2024-10-15T18:00:04Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        let arr = array.as_byte_view::<B>();
+
+        // Null row case, set and return
+        if arr.is_null(row) {


It would be nice in the future to avoid those null checks in GroupColumn (even if input is nullable field) for batches containing no nulls.

Seems really make sense.

And I found even for the batches containing some nulls, actually we have checked which rows are really nulls in create_hashes .
Maybe it is possible that, we reuse this check result?

It might make sense to pull the null / not null check into the caller of this function 🤔

🤔 I filed an issue about this, and I am trying the straight way about using null_count. #12944
Let's see the performance improvement.

Yeah, we need to do the check on the batch (does the batch contain no nulls-> follow the fast path that omits the check), so I think indeed the calling side needs to perform this.

possibly interesting: one of the reasons special casing nulls/no-nulls can be helpful is that it permits better auto vectorization, as we are documenting here: apache/arrow-rs#6554

XiangpengHao · 2024-10-15T18:52:15Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        }
+    }
+
+    fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool {


Should we use GenericByteViewArray::compare_unchecked here?

Related usage: https://github.com/apache/arrow-rs/blob/master/arrow-ord/src/cmp.rs#L568

Seems we can't use it currently... Because it seems only can accept two GenericByteViewArrays as input.

pub unsafe fn compare_unchecked( left: &GenericByteViewArray<T>, left_idx: usize, right: &GenericByteViewArray<T>, right_idx: usize, ) -> std::cmp::Ordering {

But in equal_to_inner function, one of the input is ByteViewGroupValueBuilder, and only another is GenericByteViewArray.

Actually I implement equal_to_inner by copying and modifying codes from compare_unchecked.

🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs), and reuse it rather than copying codes in future?

I see, makes sense 👍

🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs)?
And reuse it rather than copying codes in future?

I agree, probably in a follow up work

XiangpengHao · 2024-10-15T18:55:33Z

Nice work! looks good to me, left a minor comment

jayzhan211 · 2024-10-16T13:40:16Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        debug_assert!(value_len > 12);
+        let require_cap = self.in_progress.len() + value_len;
+
+        // If current block isn't big enough, flush it and create a new in progress block


Should this be If current block is big enough?

Should this be If current block is big enough?

Maybe can improve like that If current in_progress block have no enough room to hold the new value?

jayzhan211

Well done 👍

alamb · 2024-10-16T14:08:10Z

🚀

Rachelint added 2 commits October 8, 2024 14:35

define ByteGroupValueViewBuilder.

ca033e0

impl append.

ffcc1a2

github-actions bot added the physical-expr Physical Expressions label Oct 8, 2024

alamb mentioned this pull request Oct 9, 2024

Performance: Add "read strings as binary" option for parquet #12788

Closed

impl equal to.

4842965

Rachelint force-pushed the impl-byte-view-column branch from ac96b5d to 4842965 Compare October 9, 2024 17:49

Rachelint added 2 commits October 10, 2024 01:52

fix compile.

66bb7be

fix comments.

ef1efce

alamb mentioned this pull request Oct 10, 2024

Enable reading StringView by default from Parquet (schema_force_string_view) by default #11682

Closed

alamb reviewed Oct 11, 2024

View reviewed changes

Rachelint added 4 commits October 12, 2024 03:40

impl take_n.

152a8b1

impl build.

d61c3ec

impl rest functions in GroupColumn.

151377e

fix output when panic.

63e11cb

alamb mentioned this pull request Oct 11, 2024

Impl byte view column #12883

Closed

add e2e sql tests.

15d8349

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 12, 2024

Rachelint added 4 commits October 12, 2024 19:15

add unit tests.

d9ee724

switch to a really elegant style codes from alamb.

beffa35

fix take_n.

46822f9

improve comments.

3a93584

Rachelint added 2 commits October 13, 2024 01:40

fix compile.

f99f55c

fix clippy.

37b4816

Rachelint force-pushed the impl-byte-view-column branch from 5221185 to ab4c198 Compare October 13, 2024 09:08

define more testcases in test_byte_view_take_n.

d78c68d

alamb reviewed Oct 13, 2024

View reviewed changes

Rachelint and others added 3 commits October 13, 2024 21:11

Merge pull request #1 from alamb/alamb/tweak-group

f76c376

Do not re-validate output is utf8

switch to unchecked when building array.

1fd926f

improve naming.

34918cb

use let else to make the codes clearer.

8348024

Rachelint force-pushed the impl-byte-view-column branch from 5fed4eb to 8348024 Compare October 13, 2024 13:54

Rachelint added 2 commits October 13, 2024 21:59

fix typo.

023ed64

improve unit test coverage for ByteViewGroupValueBuilder.

c4d45c7

Rachelint force-pushed the impl-byte-view-column branch from 68b0eba to c4d45c7 Compare October 13, 2024 17:49

alamb mentioned this pull request Oct 14, 2024

Introduce binary_as_string parquet option, upgrade to arrow/parquet 53.2.0 #12816

Merged

alamb changed the title ~~Implement special GroupColumn support for byte view~~ Implement special GroupColumn support for StringView / ByteView Oct 15, 2024

alamb changed the title ~~Implement special GroupColumn support for StringView / ByteView~~ Implement GroupColumn support for StringView / ByteView (faster grouping performance) Oct 15, 2024

alamb approved these changes Oct 15, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into impl-byte-view-column

a1f8d0c

Dandandan reviewed Oct 15, 2024

View reviewed changes

Rachelint mentioned this pull request Oct 15, 2024

Remove unnecessary null checks in GroupColumns #12944

Open

Dandandan approved these changes Oct 15, 2024

View reviewed changes

XiangpengHao reviewed Oct 15, 2024

View reviewed changes

jayzhan211 reviewed Oct 16, 2024

View reviewed changes

jayzhan211 approved these changes Oct 16, 2024

View reviewed changes

alamb merged commit c311cf5 into apache:main Oct 16, 2024
24 checks passed

Rachelint deleted the impl-byte-view-column branch October 16, 2024 16:26

alamb mentioned this pull request Nov 5, 2024

Implement Specialized GroupColumn for Date/Time/Timestamp types for multi-column GROUP BY #13263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `GroupColumn` support for `StringView` / `ByteView` (faster grouping performance) #12809

Implement `GroupColumn` support for `StringView` / `ByteView` (faster grouping performance) #12809

Rachelint commented Oct 8, 2024 •

edited

Loading

alamb left a comment

Rachelint commented Oct 11, 2024 •

edited

Loading

Rachelint commented Oct 11, 2024

alamb commented Oct 11, 2024

Rachelint commented Oct 12, 2024

Rachelint commented Oct 12, 2024 •

edited

Loading

alamb left a comment

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

Rachelint commented Oct 13, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

alamb commented Oct 15, 2024

alamb left a comment

alamb commented Oct 15, 2024 •

edited

Loading

Rachelint commented Oct 15, 2024

Dandandan Oct 15, 2024 •

edited

Loading

Rachelint Oct 15, 2024

alamb Oct 15, 2024

Rachelint Oct 15, 2024 •

edited

Loading

Dandandan Oct 15, 2024

alamb Oct 16, 2024

XiangpengHao Oct 15, 2024

Rachelint Oct 15, 2024 •

edited

Loading

XiangpengHao Oct 15, 2024

XiangpengHao commented Oct 15, 2024

jayzhan211 Oct 16, 2024

Rachelint Oct 16, 2024 •

edited

Loading

jayzhan211 left a comment

alamb commented Oct 16, 2024

Implement GroupColumn support for StringView / ByteView (faster grouping performance) #12809

Implement GroupColumn support for StringView / ByteView (faster grouping performance) #12809

Conversation

Rachelint commented Oct 8, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Rachelint commented Oct 11, 2024 • edited Loading

Rachelint commented Oct 11, 2024

alamb commented Oct 11, 2024

Rachelint commented Oct 12, 2024

Rachelint commented Oct 12, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint commented Oct 13, 2024 • edited Loading

Rachelint commented Oct 13, 2024 • edited Loading

alamb commented Oct 15, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 15, 2024 • edited Loading

Rachelint commented Oct 15, 2024

Dandandan Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiangpengHao commented Oct 15, 2024

Choose a reason for hiding this comment

Rachelint Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Oct 16, 2024

Implement `GroupColumn` support for `StringView` / `ByteView` (faster grouping performance) #12809

Implement `GroupColumn` support for `StringView` / `ByteView` (faster grouping performance) #12809

Rachelint commented Oct 8, 2024 •

edited

Loading

Rachelint commented Oct 11, 2024 •

edited

Loading

Rachelint commented Oct 12, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

alamb commented Oct 15, 2024 •

edited

Loading

Dandandan Oct 15, 2024 •

edited

Loading

Rachelint Oct 15, 2024 •

edited

Loading

Rachelint Oct 15, 2024 •

edited

Loading

Rachelint Oct 16, 2024 •

edited

Loading