-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33484: [C++][Compute] Implement Grouper::Reset
#41352
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
Grouper::Reset
cc @westonpace |
|
rows_minibatch_.Clean(); | ||
map_.cleanup(); | ||
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool())); | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should enlarge the size of the temp_stack_
? Or the stack's internal states like top_
may no enough for the next bunch of operations after grouper reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the temp vectors allocated by the temp_stack_
would be released at the end the of the call to each individual grouper member function (IMO that's why they are designed to be "temp vector"). Meaning by the time of Reset
, the stack is empty already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, i see, thanks.
TempVectorStack
will hold a resizable buffer allocated in TempVectorStack::Init
function until the end of the grouper object.
But the TempVectorHolder
will release the used space (already allocated in TempVectorStack ) at the call to each of individual grouper member function as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a DCHECK that the temp_stack_
is empty, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's probably reasonable. But considering current temp stack doesn't have a public method for size/empty check, and I have other open PRs for temp stack restructure, I'd add the necessary methods after other PRs are done and get this one rebased.
Hi @pitrou , would you please help to take a look? Thanks. |
cpp/src/arrow/compute/row/grouper.cc
Outdated
@@ -223,12 +223,11 @@ struct AnyKeysSegmenter : public BaseRowSegmenter { | |||
|
|||
AnyKeysSegmenter(const std::vector<TypeHolder>& key_types, ExecContext* ctx) | |||
: BaseRowSegmenter(key_types), | |||
ctx_(ctx), | |||
grouper_(nullptr), | |||
grouper_(Grouper::Make(key_types, ctx).ValueOrDie()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reasonable possibility that this might fail, for example due to input types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without extra context, there is. But the factory method (cleverly) avoids this by a dummy instantiation of Grouper
:
arrow/cpp/src/arrow/compute/row/grouper.cc
Line 220 in a715ea0
ARROW_RETURN_NOT_OK(Grouper::Make(key_types, ctx)); // check types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't it pointless to construct the Grouper
twice? Can't the constructor simply take the Grouper
that was constructed in the factory method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you are saying. Will update. Thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
map_.clear(); | ||
offsets_.clear(); | ||
key_bytes_.clear(); | ||
num_groups_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about encoders_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw in https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/row_encoder_internal.h, none of the encoders holds states that would change during the lifespan of the grouper.
rows_minibatch_.Clean(); | ||
map_.cleanup(); | ||
RETURN_NOT_OK(map_.init(encode_ctx_.hardware_flags, ctx_->memory_pool())); | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a DCHECK that the temp_stack_
is empty, then?
// Assert next is the last (empty) segment. | ||
ASSERT_OK_AND_ASSIGN(auto segment, segmenter->GetNextSegment(batch, offset)); | ||
ASSERT_TRUE(segment.offset >= batch.length); | ||
ASSERT_TRUE(segment.is_open); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something be checked about segment.length
here? Or is undetermined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last segment.length
is supposed to be 0
. Updated.
Co-authored-by: Antoine Pitrou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zanmato1984 . I'll declare this good to merge, though I'd love if @westonpace could double-check this.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ada965f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Recently I've been working on some improvement for `Grouper` and I found adding `Reset` function could be beneficial. Then I trace down to apache#33484 from a TODO in code. Here comes this PR. ### What changes are included in this PR? Add `Reset` function for all the concrete `Grouper` implementations, and eliminate the recreation of `Grouper` in `AnyKeysSegmenter`. Also add more `RowSegmenter` cases covering `AnyKeysSegmenter`. ### Are these changes tested? Yes. Legacy UTs should cover it well. Also added some new UTs. ### Are there any user-facing changes? None. * GitHub Issue: apache#33484 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change Recently I've been working on some improvement for `Grouper` and I found adding `Reset` function could be beneficial. Then I trace down to apache#33484 from a TODO in code. Here comes this PR. ### What changes are included in this PR? Add `Reset` function for all the concrete `Grouper` implementations, and eliminate the recreation of `Grouper` in `AnyKeysSegmenter`. Also add more `RowSegmenter` cases covering `AnyKeysSegmenter`. ### Are these changes tested? Yes. Legacy UTs should cover it well. Also added some new UTs. ### Are there any user-facing changes? None. * GitHub Issue: apache#33484 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Recently I've been working on some improvement for
Grouper
and I found addingReset
function could be beneficial. Then I trace down to #33484 from a TODO in code. Here comes this PR.What changes are included in this PR?
Add
Reset
function for all the concreteGrouper
implementations, and eliminate the recreation ofGrouper
inAnyKeysSegmenter
.Also add more
RowSegmenter
cases coveringAnyKeysSegmenter
.Are these changes tested?
Yes. Legacy UTs should cover it well. Also added some new UTs.
Are there any user-facing changes?
None.
Grouper::Reset
#33484