Support StreamAggregation / streaming group by #5133

xiaoyong-z · 2023-01-31T15:53:07Z

If the input group by columns has been sorted before the aggregation, we can enable stream aggregation, it is more efficient than HashAggregation.

xiaoyong-z · 2023-01-31T15:56:34Z

@alamb hello, it seems that datafusion currently doesn't have StreamAggregation. If no one works on this, i want to implement it.

alamb · 2023-01-31T16:49:31Z

Hi @xiaoyong-z -- that is great news! -- I believe that @metesynnada @ozankabak mentioned they wanted to work on this feature. Let's use this ticket to collaborate on a design.

I believe #1570 is also related as streaming grouping is often used to merge the spilled groups. @milenkovicm and I had some discussion about this #1570 (comment) but I never followed through on a writeup

ozankabak · 2023-01-31T19:41:47Z

This was on our roadmap and we would love to help out on this. @alamb, if you can share with us the papers/resources you mentioned on this we can digest them and share our thinking on the design. @xiaoyong-z, do you have a particular design in mind yet?

alamb · 2023-01-31T22:02:48Z

I will begin a google doc for us to collaborate on

alamb · 2023-01-31T23:25:34Z

Here is a google doc with some ideas https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing

I have it in "comment" mode for everyone on the internet, but please feel free to request edit access and I will grant it

ozankabak · 2023-02-01T03:06:15Z

Thank you for putting this together, had an initial look. I expect us to do a deeper dive, take a look at the mentioned papers, and give meaningful comments in the next several days.

Ted-Jiang · 2023-02-01T06:27:12Z

Our team is also looking forward to this feature and the memory limited aggregation 👍

xiaoyong-z · 2023-02-01T12:45:20Z

Thank you all. I'm still in the very beginning stage, and i plan to investigate some papers and how other system implement it in the following days. @alamb thanks for sharing the google doc, i will put my system design plan on it in the future.

xiaoyong-z · 2023-02-07T16:28:32Z

I update some plans to implement the stream aggregation on https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing

, PTAL. Detail design for fully stream aggregation will be given in the next following days.

alamb · 2023-02-08T17:08:10Z

PTAL. Detail design for fully stream aggregation will be given in the next following days.

Thank you @xiaoyong-z -- I read your addition and left some comments. Overall I think it is a great idea.

Here is one possibly approach to implementation (perhaps what you had in mind):

Implement StreamAggregate that handles pre-sorted data (where the data is already sorted according to the grouping keys/ partition keys).
Remove AggregateStream https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/no_grouping.rs and replace its use in the optimizer with the new StreamAggregate operator
Update the optimizer to recognize when the input to a GroupByHash is sorted appropriately and switch to using the AggregateStream operator.

I think that would get us pretty far.

I am not sure about the idea of "sort the data first and then run the stream aggregator" -- as I mentioned in the document I think it is unlikely that approach will be better in terms of overall memory usage or performance.

When we want to support spilling group by (external group by) that is when sort might be beneficial.

ozankabak · 2023-02-08T18:15:39Z

@mustafasrepo, can you take a detailed look at @xiaoyong-z's design? Thanks.

mustafasrepo · 2023-02-09T08:24:54Z

Thanks @xiaoyong-z, For the design. I asked some questions to understand the design better, and left some comments. Overall I think, your road map is well thought and planned.

mustafasrepo · 2023-03-08T12:21:20Z

Hi @xiaoyong-z, I can receive some of the tasks from the document. Specifically I would like to start out with the case.

1.Implement StreamAggregate that handles pre-sorted data (where the data is already sorted according to the grouping keys/ partition keys). (Corresponds to 2nd step in the document I guess.)

If you are not working already on this feature.

xiaoyong-z · 2023-03-11T04:40:18Z

@mustafasrepo Sorry, currently i don't have time to push this work.
If you have time on your side, you can work on any part of this feature.

xiaoyong-z added the enhancement New feature or request label Jan 31, 2023

alamb changed the title ~~Support StreamAggregation~~ Support StreamAggregation / streaming group by Mar 3, 2023

alamb mentioned this issue Mar 3, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

4 tasks

mustafasrepo mentioned this issue Apr 17, 2023

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered #6034

Closed

alamb mentioned this issue Apr 26, 2023

Implement Streaming Aggregation: Do not break pipeline in aggregation if group by columns are ordered (V2) #6124

Merged

mustafasrepo closed this as completed in #6124 Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support StreamAggregation / streaming group by #5133

Support StreamAggregation / streaming group by #5133

xiaoyong-z commented Jan 31, 2023 •

edited

Loading

xiaoyong-z commented Jan 31, 2023 •

edited

Loading

alamb commented Jan 31, 2023

ozankabak commented Jan 31, 2023

alamb commented Jan 31, 2023

alamb commented Jan 31, 2023

ozankabak commented Feb 1, 2023

Ted-Jiang commented Feb 1, 2023

xiaoyong-z commented Feb 1, 2023 •

edited

Loading

xiaoyong-z commented Feb 7, 2023 •

edited

Loading

alamb commented Feb 8, 2023

ozankabak commented Feb 8, 2023

mustafasrepo commented Feb 9, 2023 •

edited

Loading

mustafasrepo commented Mar 8, 2023

xiaoyong-z commented Mar 11, 2023

Support StreamAggregation / streaming group by #5133

Support StreamAggregation / streaming group by #5133

Comments

xiaoyong-z commented Jan 31, 2023 • edited Loading

xiaoyong-z commented Jan 31, 2023 • edited Loading

alamb commented Jan 31, 2023

ozankabak commented Jan 31, 2023

alamb commented Jan 31, 2023

alamb commented Jan 31, 2023

ozankabak commented Feb 1, 2023

Ted-Jiang commented Feb 1, 2023

xiaoyong-z commented Feb 1, 2023 • edited Loading

xiaoyong-z commented Feb 7, 2023 • edited Loading

alamb commented Feb 8, 2023

ozankabak commented Feb 8, 2023

mustafasrepo commented Feb 9, 2023 • edited Loading

mustafasrepo commented Mar 8, 2023

xiaoyong-z commented Mar 11, 2023

xiaoyong-z commented Jan 31, 2023 •

edited

Loading

xiaoyong-z commented Jan 31, 2023 •

edited

Loading

xiaoyong-z commented Feb 1, 2023 •

edited

Loading

xiaoyong-z commented Feb 7, 2023 •

edited

Loading

mustafasrepo commented Feb 9, 2023 •

edited

Loading