-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GroupColumn
support for StringView
/ ByteView
(faster grouping performance)
#12809
Conversation
ac96b5d
to
4842965
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- I am going to try and hook it up and write a few tests
Thanks @Rachelint
Thanks @alamb , I am working on implementing the rest main function |
The rest work is to add tests. |
Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883 Maybe you can use / repurpose the tests. I'll try and find time to review this weekend, but I may not have as much time as normal |
Thanks, it helps much! |
It is close to be ready, let's add more unit testcases before. |
5221185
to
ab4c198
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.
cc @jayzhan211 your GroupColumn pattern is really working well
There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.
I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092
I also ran test coverage of this PR like this:
cargo llvm-cov --html -p datafusion-physical-plan --lib
Here is the report:
coverage.zip
In general very nice job with coverage. There are a few items that appear to be untested:
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n
logic). I have some ideas (in #12847) that I hope to refine tomorrow
|
||
// The `n == len` case, we need to take all | ||
if self.len() == n { | ||
let new_builder = Self::new().with_max_block_size(self.max_block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
// - Get the last related `buffer index`(let's name it `buffer index n`) | ||
// from last non-inlined `view` | ||
// | ||
// - Take buffers, the key is that we need to know if we need to take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these comments. Very nice 💯
// 6. Take non-inlined + while last buffer in ``in_progress` | ||
// 7. Take all views at once | ||
|
||
let mut builder = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
|
||
if let Some(view) = last_non_inlined_view { | ||
let view = ByteView::from(*view); | ||
let last_related_buffer_index = view.buffer_index as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a name like last_remaining_buffer_index
might be clearer about what this quantity represents
|
||
// Build array and return | ||
let views = ScalarBuffer::from(first_n_views); | ||
Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I think we should use new_unchecked
here as all the data is valid by construction (maybe we could keep the check in debug builds)
.rev() | ||
.find(|view| ((**view) as u32) > 12); | ||
|
||
if let Some(view) = last_non_inlined_view { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, you could reduce the indenting in this function by using a let else
, like
let Some(view) = last_non_inlined_view else {
let views = ScalarBuffer::from(first_n_views);
return Arc::new(GenericByteViewArray::<B>::new(
views,
Vec::new(),
null_buffer,
))
}
fn take_buffers_with_partial_last( | ||
&mut self, | ||
last_related_buffer_index: usize, | ||
take_len: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could call this last_take_len
or something to note it is the number of bytes being taken from the last buffer
Do not re-validate output is utf8
Comments about readability improvement are fixed. I am adding test for better test coverage. |
5fed4eb
to
8348024
Compare
68b0eba
to
c4d45c7
Compare
@alamb 👍 Thanks for reminding about the test coverage. After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected. I have refined the tests for |
I plan to complete my review today (sorry I was out yesterday) |
GroupColumn
support for byte viewGroupColumn
support for StringView
/ ByteView
GroupColumn
support for StringView
/ ByteView
GroupColumn
support for StringView
/ ByteView
(faster grouping performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )
Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench
I merged up from main and will plan to merge this PR tomorrow unless there is anyone else who would like time to review FYI @XiangpengHao and @Dandandan and @jayzhan211 |
Really exciting! |
let arr = array.as_byte_view::<B>(); | ||
|
||
// Null row case, set and return | ||
if arr.is_null(row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice in the future to avoid those null checks in GroupColumn
(even if input is nullable field) for batches containing no nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems really make sense.
And I found even for the batches containing some nulls, actually we have checked which rows are really nulls in create_hashes
.
Maybe it is possible that, we reuse this check result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to pull the null / not null check into the caller of this function 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I filed an issue about this, and I am trying the straight way about using null_count
. #12944
Let's see the performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to do the check on the batch (does the batch contain no nulls-> follow the fast path that omits the check), so I think indeed the calling side needs to perform this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly interesting: one of the reasons special casing nulls/no-nulls can be helpful is that it permits better auto vectorization, as we are documenting here: apache/arrow-rs#6554
} | ||
} | ||
|
||
fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use GenericByteViewArray::compare_unchecked
here?
Related usage: https://github.com/apache/arrow-rs/blob/master/arrow-ord/src/cmp.rs#L568
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we can't use it currently... Because it seems only can accept two GenericByteViewArray
s as input.
pub unsafe fn compare_unchecked(
left: &GenericByteViewArray<T>,
left_idx: usize,
right: &GenericByteViewArray<T>,
right_idx: usize,
) -> std::cmp::Ordering {
But in equal_to_inner
function, one of the input is ByteViewGroupValueBuilder
, and only another is GenericByteViewArray
.
Actually I implement equal_to_inner
by copying and modifying codes from compare_unchecked
.
🤔Maybe we can make compre_unchecked
able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs), and reuse it rather than copying codes in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense 👍
🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs)?
And reuse it rather than copying codes in future?
I agree, probably in a follow up work
Nice work! looks good to me, left a minor comment |
debug_assert!(value_len > 12); | ||
let require_cap = self.in_progress.len() + value_len; | ||
|
||
// If current block isn't big enough, flush it and create a new in progress block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be If current block is big enough
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
If current block is big enough
?
Maybe can improve like that If current in_progress block have no enough room to hold the new value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done 👍
🚀 |
Which issue does this PR close?
Closes #12771
Rationale for this change
The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.
What changes are included in this PR?
Support new excellent column based multi group values for byte view column.
Are these changes tested?
Yes, test by new unit tests and e2e tests (most of them helped by @alamb )
Are there any user-facing changes?
No.