Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add ordered aggregation #32884

Closed
asfimport opened this issue Sep 7, 2022 · 1 comment · Fixed by #34311
Closed

[C++] Add ordered aggregation #32884

asfimport opened this issue Sep 7, 2022 · 1 comment · Fixed by #34311

Comments

@asfimport
Copy link
Collaborator

Ordered aggregation is similar to grouped aggregation except that one column in the grouping key is (known to be) ordered. The result of both types of aggregations is the same but the existence of an ordered column enables optimizing.

Reporter: Yaron Gvili / @rtpsw
Assignee: Yaron Gvili / @rtpsw

PRs and other links:

Note: This issue was originally created as ARROW-17642. Please see the migration documentation for further details.

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
rtpsw added a commit to rtpsw/arrow that referenced this issue Feb 23, 2023
rtpsw added a commit to rtpsw/arrow that referenced this issue Feb 23, 2023
rtpsw added a commit to rtpsw/arrow that referenced this issue Mar 3, 2023
rtpsw added a commit to rtpsw/arrow that referenced this issue Mar 7, 2023
@rtpsw
Copy link
Contributor

rtpsw commented Mar 9, 2023

Follow-ups listed in #34475

icexelloss added a commit that referenced this issue Mar 10, 2023
This PR implements "Segmented Aggregation" to the existing aggregation
node to improve aggregation on ordered data.

A segment group is defined as "a continuous chunk of data that have the
same segment key value. e.g, if the input data looks like

```
[0, 0, 0, 1, 2, 2] 
```

Then there are three segments `[0, 0, 0]` `[1]` `[2, 2]`

(Note the "group" in "segment group" here is added to differentiate from
"segment", which is defined as "a continuous chunk of data with in a
ExecBatch")

Segment aggregation can be used to replace existing hash aggregation in
the case that data are ordered. The benefit of this is
(1) We can output aggregation result earlier (as soon as a segment group
is fully consumed).
(2) We only need to hold partial aggregation for one segment group to
reduce memory usage.

See https://issues.apache.org/jira/browse/ARROW-17642

Replaces #14352
* Closes: #32884

Follow ups
=======
* #34475 
* #34529

---------

Co-authored-by: Li Jin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants