Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array agg groups accumulator #10149

Closed
wants to merge 5 commits into from

Conversation

lkt
Copy link

@lkt lkt commented Apr 20, 2024

Which issue does this PR close?

Closes #10145.

Rationale for this change

See the issue.

What changes are included in this PR?

GroupsAccumulator for array_agg aggregation function for:

  • Primitive types
  • String type

Not included:

  • Accumulating arrays of any level of nesting.

Are these changes tested?

Extended tests in array_agg.rs

Are there any user-facing changes?

No, just a performance improvement.

@github-actions github-actions bot added the physical-expr Physical Expressions label Apr 20, 2024
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know of a benchmark we can run to test these changes? 👀

@@ -96,6 +109,139 @@ impl AggregateExpr for ArrayAgg {
fn name(&self) -> &str {
&self.name
}

fn groups_accumulator_supported(&self) -> bool {
self.input_data_type.is_primitive() || self.input_data_type == DataType::Utf8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to support LargeUtf8 and Binary as well? Given how similar they are to Utf8?

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

Thank you for this PR @lkt -- I will take a look at this one carefully shortly

opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(new_values.len(), 1, "single argument to update_batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return an internal_err!("") with more user friendly message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I would keep the assert_eq! here to keep it consistent with other implementations of GroupsAccumulator. Eg. see here among others:
https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/aggregate/groups_accumulator/prim_op.rs?plain=1#L93

Also this assertion is meant for developer to ensure the correct number of elements in the array between aggregation stages, not the end user. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to get rid of application crashes, returning error instead of panic will help. But thanks for pointing out we may want to revisit assertions and return errors wherever possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10232

opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(new_values.len(), 1, "single argument to update_batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(values.len(), 1, "single argument to merge_batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@comphead
Copy link
Contributor

Thanks @lkt for your contribution, does this PR stand for array_agg performance improvement? If so it is good to see a bench mark results

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 25, 2024
@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Seems to be superceded by #11096

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jun 29, 2024
@alamb alamb marked this pull request as draft July 1, 2024 11:48
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 31, 2024
@github-actions github-actions bot closed this Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement GroupsAccumulator for array_agg aggregation function
4 participants