Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GroupColumn support for StringView / ByteView (faster grouping performance) #12809

Merged
merged 27 commits into from
Oct 16, 2024

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Oct 8, 2024

Which issue does this PR close?

Closes #12771

Rationale for this change

The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.

What changes are included in this PR?

Support new excellent column based multi group values for byte view column.

Are these changes tested?

Yes, test by new unit tests and e2e tests (most of them helped by @alamb )

Are there any user-facing changes?

No.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 11, 2024

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Thanks @alamb , I am working on implementing the rest main function take_n and build now.
A bit busy recent few days... I will help continue to push this forward from today.

@Rachelint
Copy link
Contributor Author

The rest work is to add tests.
Will finish it today.

@alamb alamb mentioned this pull request Oct 11, 2024
@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

@Rachelint
Copy link
Contributor Author

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Thanks, it helps much!

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 12, 2024
@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 12, 2024

take_n is actually complex, I fixed ByteViewGroupValueBuilder::take_n

It is close to be ready, let's add more unit testcases before.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report:
coverage.zip

In general very nice job with coverage. There are a few items that appear to be untested:

Screenshot 2024-10-13 at 8 46 38 AM Screenshot 2024-10-13 at 8 47 09 AM

I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow


// The `n == len` case, we need to take all
if self.len() == n {
let new_builder = Self::new().with_max_block_size(self.max_block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

// - Get the last related `buffer index`(let's name it `buffer index n`)
// from last non-inlined `view`
//
// - Take buffers, the key is that we need to know if we need to take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these comments. Very nice 💯

// 6. Take non-inlined + while last buffer in ``in_progress`
// 7. Take all views at once

let mut builder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍


if let Some(view) = last_non_inlined_view {
let view = ByteView::from(*view);
let last_related_buffer_index = view.buffer_index as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a name like last_remaining_buffer_index might be clearer about what this quantity represents


// Build array and return
let views = ScalarBuffer::from(first_n_views);
Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I think we should use new_unchecked here as all the data is valid by construction (maybe we could keep the check in debug builds)

.rev()
.find(|view| ((**view) as u32) > 12);

if let Some(view) = last_non_inlined_view {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, you could reduce the indenting in this function by using a let else, like

        let Some(view) = last_non_inlined_view else {
            let views = ScalarBuffer::from(first_n_views);
            return Arc::new(GenericByteViewArray::<B>::new(
                views,
                Vec::new(),
                null_buffer,
            ))
        }

fn take_buffers_with_partial_last(
&mut self,
last_related_buffer_index: usize,
take_len: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could call this last_take_len or something to note it is the number of bytes being taken from the last buffer

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 13, 2024

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report: coverage.zip
...
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

Comments about readability improvement are fixed.

I am adding test for better test coverage.

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 13, 2024

@alamb 👍 Thanks for reminding about the test coverage.

After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected.

I have refined the tests for equal_to and take_n, and all related paths are covered according to the report now!

@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

I plan to complete my review today (sorry I was out yesterday)

@alamb alamb changed the title Implement special GroupColumn support for byte view Implement special GroupColumn support for StringView / ByteView Oct 15, 2024
@alamb alamb changed the title Implement special GroupColumn support for StringView / ByteView Implement GroupColumn support for StringView / ByteView (faster grouping performance) Oct 15, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )

Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench

@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

I merged up from main and will plan to merge this PR tomorrow unless there is anyone else who would like time to review

FYI @XiangpengHao and @Dandandan and @jayzhan211

@Rachelint
Copy link
Contributor Author

I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )

Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench

Really exciting!

let arr = array.as_byte_view::<B>();

// Null row case, set and return
if arr.is_null(row) {
Copy link
Contributor

@Dandandan Dandandan Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice in the future to avoid those null checks in GroupColumn (even if input is nullable field) for batches containing no nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems really make sense.

And I found even for the batches containing some nulls, actually we have checked which rows are really nulls in create_hashes .
Maybe it is possible that, we reuse this check result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to pull the null / not null check into the caller of this function 🤔

Copy link
Contributor Author

@Rachelint Rachelint Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I filed an issue about this, and I am trying the straight way about using null_count. #12944
Let's see the performance improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to do the check on the batch (does the batch contain no nulls-> follow the fast path that omits the check), so I think indeed the calling side needs to perform this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly interesting: one of the reasons special casing nulls/no-nulls can be helpful is that it permits better auto vectorization, as we are documenting here: apache/arrow-rs#6554

}
}

fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@Rachelint Rachelint Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can't use it currently... Because it seems only can accept two GenericByteViewArrays as input.

    pub unsafe fn compare_unchecked(
        left: &GenericByteViewArray<T>,
        left_idx: usize,
        right: &GenericByteViewArray<T>,
        right_idx: usize,
    ) -> std::cmp::Ordering {

But in equal_to_inner function, one of the input is ByteViewGroupValueBuilder, and only another is GenericByteViewArray.

Actually I implement equal_to_inner by copying and modifying codes from compare_unchecked.

🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs), and reuse it rather than copying codes in future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense 👍

🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs)?
And reuse it rather than copying codes in future?

I agree, probably in a follow up work

@XiangpengHao
Copy link
Contributor

Nice work! looks good to me, left a minor comment

debug_assert!(value_len > 12);
let require_cap = self.in_progress.len() + value_len;

// If current block isn't big enough, flush it and create a new in progress block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be If current block is big enough?

Copy link
Contributor Author

@Rachelint Rachelint Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be If current block is big enough?

Maybe can improve like that If current in_progress block have no enough room to hold the new value?

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👍

@alamb
Copy link
Contributor

alamb commented Oct 16, 2024

🚀

@alamb alamb merged commit c311cf5 into apache:main Oct 16, 2024
24 checks passed
@Rachelint Rachelint deleted the impl-byte-view-column branch October 16, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement special Groups for StringViews
5 participants