Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32884: [C++] Add ordered aggregation #34311

Merged
merged 30 commits into from
Mar 10, 2023
Merged

GH-32884: [C++] Add ordered aggregation #34311

merged 30 commits into from
Mar 10, 2023

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Feb 23, 2023

This PR implements "Segmented Aggregation" to the existing aggregation node to improve aggregation on ordered data.

A segment group is defined as "a continuous chunk of data that have the same segment key value. e.g, if the input data looks like

[0, 0, 0, 1, 2, 2] 

Then there are three segments [0, 0, 0] [1] [2, 2]

(Note the "group" in "segment group" here is added to differentiate from "segment", which is defined as "a continuous chunk of data with in a ExecBatch")

Segment aggregation can be used to replace existing hash aggregation in the case that data are ordered. The benefit of this is
(1) We can output aggregation result earlier (as soon as a segment group is fully consumed).
(2) We only need to hold partial aggregation for one segment group to reduce memory usage.

See https://issues.apache.org/jira/browse/ARROW-17642

Replaces #14352

Follow ups

@rtpsw rtpsw requested a review from westonpace as a code owner February 23, 2023 10:21
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #32884 has been automatically assigned in GitHub to PR creator.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 23, 2023

@icexelloss, please review the design in this commit.

@westonpace, in conflict resolution, I see cpp/src/arrow/compute/exec/aggregate.{h,cc} have been deleted. What was the story around that? I'm trying to figure out how best to resolve.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 23, 2023

@westonpace, in conflict resolution, I see cpp/src/arrow/compute/exec/aggregate.{h,cc} have been deleted. What was the story around that? I'm trying to figure out how best to resolve.

I found the commit you made which removed these files and I generally figured it out. I pushed a resolution of the conflicts.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 23, 2023

See segmented aggregation as a generalization of ordered aggregation (in the PR replaced by this one).

@icexelloss
Copy link
Contributor

icexelloss commented Feb 23, 2023

See segmented aggregation as a generalization of ordered aggregation (in the PR replaced by this one).

@rtpsw Can you add some code comment explain the concept, data structure and algorithm? This way the code is more documentation and the reader doesn't need to jump through review/pr links to understand the code.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the mutex stuff for now since we expect this to be single threaded. We can put it back in later.

Let's remove the paths handling chunked arrays as they should no longer be needed.

I think we can even make this simpler by removing all use of ExecSpan in favor of ExecBatch but we can do that in a follow-up (especially since you didn't introduce this).

I'll take another pass once that's done but I went through it pretty thoroughly today. I think I finally understand everything and it looks like a pretty good approach.

Once we move to a multithreaded approach I think the only thing we will need to serialize on is figuring out the segment boundaries. The way this is setup today that responsibility lies with the segmenter so I think this should be pretty straightforward. We can worry / talk more about that later.

cpp/src/arrow/compute/exec/options.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/hash_aggregate_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Show resolved Hide resolved
@westonpace
Copy link
Member

Also, can we add at least one or two basic end-to-end tests in plan_test.cc (or you could create an aggregate_node_test.cc). Partly as examples for future readers as much as anything.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 24, 2023

Also, can we add at least one or two basic end-to-end tests in plan_test.cc (or you could create an aggregate_node_test.cc). Partly as examples for future readers as much as anything.

My recent commit should have the requested fixes except the above one for adding tests (TBD).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 8, 2023
Comment on lines 49 to 51
// values [A, B, A] at row-indices [0, 1, 2]. A regular group-by aggregation with keys [X]
// yields a row-index partitioning [[0, 2], [1]] whereas a segmented-group-by aggregation
// with segment-keys [X] yields [[0], [1], [2]].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: This example could be slightly improved I think if you used [A, A, B, A] so that readers could see that the segmented group by still does segment.

Comment on lines 188 to 190
// Handle the input batch
// If a segment is closed by this batch, then we output the aggregation for the segment
// If a segment is not closed by this batch, then we add the batch to the segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Handle the input batch
// If a segment is closed by this batch, then we output the aggregation for the segment
// If a segment is not closed by this batch, then we add the batch to the segment
// Extract segments from a batch and run the given handler on them. Note that the
// handle may be called on open segments which are not yet finished. Typically a
// handler should accumulate those open segments until a closed segment is reached.

// If a segment is closed by this batch, then we output the aggregation for the segment
// If a segment is not closed by this batch, then we add the batch to the segment
template <typename BatchHandler>
Status HandleSegments(std::unique_ptr<RowSegmenter>& segmenter, const ExecBatch& batch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status HandleSegments(std::unique_ptr<RowSegmenter>& segmenter, const ExecBatch& batch,
Status HandleSegments(RowSegmenter* segmenter, const ExecBatch& batch,

Prefer pointer over mutable reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of the comments in here are not addressed. I will create a follow up to track.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm - looks like github diff issue

cpp/src/arrow/compute/exec/aggregate_node.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate_node.cc Outdated Show resolved Hide resolved
Comment on lines 218 to 219
///
/// See also doc in `aggregate_node.cc`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///
/// See also doc in `aggregate_node.cc`

This documentation is for users. I'm not sure we should be directing users to aggregate_node.cc. Also, it's not clear what doc this is referring to. I think this is fine as it is without the "see also".

Copy link
Contributor

@icexelloss icexelloss Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtpsw Looks like you missed out on this comment (minor issue)

cpp/src/arrow/compute/row/grouper.cc Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.cc Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.cc Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Mar 8, 2023
@rtpsw rtpsw requested a review from westonpace March 8, 2023 17:29
@rtpsw
Copy link
Contributor Author

rtpsw commented Mar 8, 2023

@westonpace, is this good to go?

@westonpace
Copy link
Member

From my perspective, yes. We can merge once @icexelloss approves.

@icexelloss
Copy link
Contributor

@rtpsw there are many follow up items from this PR, can you include the list of follows up in PR title so we have at least some ways to track it? If you have a GH issue, please list follow ups here as well.

@icexelloss
Copy link
Contributor

@rtpsw I think this is getting close but there are still a number of unresolved thread. Please check those are resolved and ping me when it is ready for me to take another look.

@rtpsw
Copy link
Contributor Author

rtpsw commented Mar 8, 2023

@rtpsw I think this is getting close but there are still a number of unresolved thread. Please check those are resolved and ping me when it is ready for me to take another look.

@icexelloss, I went over the unresolved discussions and commented; I think they can now be resolved. The deferred issues I found are in #34475.

@icexelloss
Copy link
Contributor

Thanks @rtpsw I will take a look later Today.

Can you gather all the follow up issues and put them as a list in the PR description and the origin GH issue as well?

@github-actions
Copy link

github-actions bot commented Mar 9, 2023

⚠️ GitHub issue #32884 has been automatically assigned in GitHub to PR creator.

@rtpsw
Copy link
Contributor Author

rtpsw commented Mar 9, 2023

Can you gather all the follow up issues and put them as a list in the PR description and the origin GH issue as well?

Linked in both PR and original GH issue to #34475 which has the list.

@icexelloss
Copy link
Contributor

Linked in both PR and original GH issue to #34475 which has the list.

Can you put the follow up GH issue link in the list?

@github-actions
Copy link

⚠️ GitHub issue #32884 has been automatically assigned in GitHub to PR creator.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM approved

@icexelloss
Copy link
Contributor

I also edited the PR description to make the purpose of this PR a bit more clear.

@ursabot
Copy link

ursabot commented Mar 10, 2023

Benchmark runs are scheduled for baseline = 4c05a3b and contender = 9baefea. 9baefea is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.39% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.77% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.69% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9baefea1 ec2-t3-xlarge-us-east-2
[Finished] 9baefea1 test-mac-arm
[Finished] 9baefea1 ursa-i9-9960x
[Finished] 9baefea1 ursa-thinkcentre-m75q
[Finished] 4c05a3b4 ec2-t3-xlarge-us-east-2
[Finished] 4c05a3b4 test-mac-arm
[Finished] 4c05a3b4 ursa-i9-9960x
[Finished] 4c05a3b4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add ordered aggregation
4 participants