feat: support `grouping` aggregate function #10208

JasonLi-cn · 2024-04-24T07:02:30Z

Which issue does this PR close?

Rationale for this change

Currently datafusion does not implement grouping function.

datafusion/datafusion/physical-expr/src/aggregate/grouping.rs

Lines 80 to 84 in 4edbdd7

    
           fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> { 
        
               not_impl_err!( 
        
                   "physical plan is not yet implemented for GROUPING aggregate function" 
        
               ) 
        
           }

What changes are included in this PR?

Complete the grouping function.

https://www.postgresql.org/docs/9.5/functions-aggregate.html
https://learn.microsoft.com/en-us/sql/t-sql/functions/grouping-transact-sql?view=sql-server-ver15

Are these changes tested?

Yes

Are there any user-facing changes?

Yes. Perhaps we need to include in the documentation instructions for the grouping function.
https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html

JasonLi-cn · 2024-04-24T07:04:04Z

Related work:
#2477
#2486

waynexia

Thanks @JasonLi-cn for this 👍

I noticed this from PG's document:

. The arguments to the GROUPING operation are not actually evaluated, but they must match exactly expressions given in the GROUP BY clause of the associated query level.

Do we need to do some verifications to make sure the param of GROUPING matches GROUP BY? Also I see the implementation of GroupingGroupsAccumulator assumes the input expr is column, but GROUP BY doesn't have such a constrain.

waynexia · 2024-04-28T06:46:34Z

datafusion/expr/src/aggregate_function.rs

+            AggregateFunction::Count | AggregateFunction::Grouping => {
+                Signature::variadic_any(Volatility::Immutable)
+            }


To my understanding, this is the key change in the user-faced behavior of this PR, supporting grouping() over multiple columns.

waynexia · 2024-04-28T06:57:26Z

datafusion/physical-expr/src/aggregate/grouping.rs

+    mask
+}
+
+impl GroupsAccumulator for GroupingGroupsAccumulator {


From the comment of GroupsAccumulator https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html#notes-on-implementing-groupaccumulator:

All aggregates must first implement the simpler Accumulator trait, which handles state for a single group. Implementing GroupsAccumulator is optional and is harder to implement than Accumulator, but can be much faster for queries with many group values.

I suppose this grouping group accumulator can also follow this to implement Accumulator as well

I agree with you. I also wanted to realize Accumulator. However, I find that Accumulator cannot be implemented based on the current definition. When calling update_batch, we need to know the information of the current grouping set, so we need to add a parameter to update_batch. This is a big change, I need to get community's advice.

I see. Accumulator::update_batch doesn't pass group_indices: &[usize] in. It does not make sense to me if we maintain one in another way for this special grouping expr only for implementing Accumulator. Do you have any insight @alamb ?

I agree having some special case simply for the grouping aggregate that forces changes on all other aggregates isn't ideal

When calling update_batch, we need to know the information of the current grouping set, so we need to add a parameter to update_batch

After reading https://www.postgresql.org/docs/9.5/functions-aggregate.html I see that grouping is basically a special case that only makes sense in the context of grouping set (it provides some context into the grouping set).

Given it is so special, I wonder if we could special case it somehow 🤔

One thing maybe we could do is to add another signature?

trait `GroupsAccumulator` { ... /// Called with the information with what grouping set this batch belongs to. /// The default implementaiton calls `Self::update_batch` and ignores the grouping_set fn update_grouping_batch( &mut self, _values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&arrow_array::BooleanArray>, total_num_groups: usize, grouping_set: &[bool], ) -> Result<()> { self.update_batch(_values, group_indices, opt_filter, total_num_groups) } ...

And then we could make it clear in the documentation that the agregator calls update_group_batch but that most implementations can just implement update_batch

waynexia · 2024-04-28T06:58:20Z

datafusion/physical-expr/src/aggregate/grouping.rs

+    fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
+        not_impl_err!(
+            "physical plan is not yet implemented for GROUPING aggregate function"
+        )
+    }


Can we implement Accumulator for GroupingGroupsAccumulator and then implement this method?

JasonLi-cn · 2024-04-28T11:50:20Z

Thanks @JasonLi-cn for this 👍

I noticed this from PG's document:

. The arguments to the GROUPING operation are not actually evaluated, but they must match exactly expressions given in the GROUP BY clause of the associated query level.

Do we need to do some verifications to make sure the param of GROUPING matches GROUP BY? Also I see the implementation of GroupingGroupsAccumulator assumes the input expr is column, but GROUP BY doesn't have such a constrain.

Thanks @waynexia for your suggestion. I agree with you and I will make improvements according to your suggestions.

alamb

Thanks for this PR @JasonLi-cn and the review @waynexia

I was not familiar with the grouping function before. Fascinating. I left some thoughts

I apologize for the delay in review. I have been very bust lately

alamb · 2024-05-21T20:18:20Z

datafusion/physical-expr/src/aggregate/grouping.rs

+    mask
+}
+
+impl GroupsAccumulator for GroupingGroupsAccumulator {


I agree having some special case simply for the grouping aggregate that forces changes on all other aggregates isn't ideal

When calling update_batch, we need to know the information of the current grouping set, so we need to add a parameter to update_batch

After reading https://www.postgresql.org/docs/9.5/functions-aggregate.html I see that grouping is basically a special case that only makes sense in the context of grouping set (it provides some context into the grouping set).

Given it is so special, I wonder if we could special case it somehow 🤔

One thing maybe we could do is to add another signature?

trait `GroupsAccumulator` { ... /// Called with the information with what grouping set this batch belongs to. /// The default implementaiton calls `Self::update_batch` and ignores the grouping_set fn update_grouping_batch( &mut self, _values: &[ArrayRef], group_indices: &[usize], opt_filter: Option<&arrow_array::BooleanArray>, total_num_groups: usize, grouping_set: &[bool], ) -> Result<()> { self.update_batch(_values, group_indices, opt_filter, total_num_groups) } ...

And then we could make it clear in the documentation that the agregator calls update_group_batch but that most implementations can just implement update_batch

github-actions · 2024-07-21T01:54:16Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

feat: support grouping aggregate function

c716130

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 24, 2024

JasonLi-cn changed the title ~~feat: support grouping aggregate function~~ feat: support grouping aggregate function Apr 24, 2024

chore: pass grouping.slt test

3ecf361

alamb mentioned this pull request Apr 25, 2024

DataFusion weekly project plan (Andrew Lamb) - April 22, 2024 #10172

Closed

7 tasks

waynexia reviewed Apr 28, 2024

View reviewed changes

add more tests

272be90

alamb mentioned this pull request Apr 29, 2024

DataFusion weekly project plan (Andrew Lamb) - April 29, 2024 #10283

Closed

8 tasks

alamb reviewed May 21, 2024

View reviewed changes

github-actions bot added the Stale PR has not had any activity for some time label Jul 21, 2024

JasonLi-cn marked this pull request as draft July 21, 2024 06:34

github-actions bot removed the Stale PR has not had any activity for some time label Jul 22, 2024

alamb mentioned this pull request Sep 19, 2024

Support Grouping functions with Group By CUBE/ROLLUP/GROUPING SETS #5647

Closed

eejbyfeldt mentioned this pull request Oct 7, 2024

feat: Implement grouping function using grouping id #12704

Merged

alamb closed this in #12704 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `grouping` aggregate function #10208

feat: support `grouping` aggregate function #10208

JasonLi-cn commented Apr 24, 2024 •

edited by alamb

Loading

JasonLi-cn commented Apr 24, 2024

waynexia left a comment

waynexia Apr 28, 2024

waynexia Apr 28, 2024

JasonLi-cn Apr 28, 2024

waynexia Apr 29, 2024

alamb May 21, 2024

waynexia Apr 28, 2024

JasonLi-cn commented Apr 28, 2024

alamb left a comment

alamb May 21, 2024

github-actions bot commented Jul 21, 2024

	fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
	not_impl_err!(
	"physical plan is not yet implemented for GROUPING aggregate function"
	)
	}

feat: support grouping aggregate function #10208

feat: support grouping aggregate function #10208

Conversation

JasonLi-cn commented Apr 24, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

JasonLi-cn commented Apr 24, 2024

waynexia left a comment

Choose a reason for hiding this comment

waynexia Apr 28, 2024

Choose a reason for hiding this comment

waynexia Apr 28, 2024

Choose a reason for hiding this comment

JasonLi-cn Apr 28, 2024

Choose a reason for hiding this comment

waynexia Apr 29, 2024

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

waynexia Apr 28, 2024

Choose a reason for hiding this comment

JasonLi-cn commented Apr 28, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Jul 21, 2024

feat: support `grouping` aggregate function #10208

feat: support `grouping` aggregate function #10208

JasonLi-cn commented Apr 24, 2024 •

edited by alamb

Loading