Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udf): add support for builtin aggregate UDFs #7134

Merged
merged 6 commits into from
Sep 14, 2023

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 12, 2023

Adds support for builtin user-defined aggregate functions. This came up as a use case during a blog post I am authoring on the expanded array functionality for BigQuery that is landing in ibis 7.0.0. The blog post no longer uses this feature, but it's a useful one regardless.

Depends on #7122.

@cpcloud cpcloud added this to the 7.0 milestone Sep 12, 2023
@cpcloud cpcloud added feature Features or general enhancements udf Issues related to user-defined functions experimental Experimental features labels Sep 12, 2023
@cpcloud cpcloud changed the title builtin udafs feat(udf): add support for builtin aggregate UDFs Sep 12, 2023
@cpcloud cpcloud added the sql Backends that generate SQL label Sep 12, 2023
@cpcloud cpcloud force-pushed the builtin-udafs branch 2 times, most recently from cdda71e to 1398acd Compare September 13, 2023 10:44
@cpcloud cpcloud marked this pull request as ready for review September 13, 2023 10:46
@cpcloud cpcloud force-pushed the builtin-udafs branch 9 times, most recently from 79d4668 to 4158ccc Compare September 13, 2023 17:31
@cpcloud cpcloud requested a review from jcrist September 14, 2023 11:59
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems pretty good.

One thing I thought about while reviewing this is that maybe grouping functionality by UDF implementation (builtin, pyarrow, ...) rather than kind (scalar, agg, ...) would make more sense? So it'd be udf.builtin.scalar instead of udf.scalar.builtin?

Most/all backends should be able to support udf.builtin for exposing builtin functionality, but the other UDF types would be rarer to support. If a user wants to expose a builtin function directing them to udf.builtin.* rather than udf.*.builtin might make things clearer since that functionality is all grouped together?

No need to deal with this now/here if we want to make this change, but it came up while reviewing it so I thought I'd mention it.

ibis/expr/operations/udf.py Outdated Show resolved Hide resolved
ibis/expr/operations/udf.py Show resolved Hide resolved
docs/how-to/extending/builtin.qmd Outdated Show resolved Hide resolved
docs/how-to/extending/builtin.qmd Show resolved Hide resolved
docs/how-to/extending/builtin.qmd Outdated Show resolved Hide resolved
@cpcloud
Copy link
Member Author

cpcloud commented Sep 14, 2023

One thing I thought about while reviewing this is that maybe grouping functionality by UDF implementation (builtin, pyarrow, ...) rather than kind (scalar, agg, ...) would make more sense? So it'd be udf.builtin.scalar instead of udf.scalar.builtin?

Yeah ...

I went back and forth on this a bit when building the API.

I don't have a super strong opinion, but my thinking is based on how people talk about the separate kinds of UDFs in other systems.

The first point of departure in my experience has always been the shape of the output: scalar, aggregate, or tabular, and so that was the first level of namespacing I chose.

I think but am not entirely sure, that this makes implementation a bit cleaner because there's more overlap in behavior in the scalar category than the builtin category. That's not really based on thorough analysis of anything though, just my intuition.

I think when I am reading UDF signatures the first thing I want to know is where I can use it, and scalar/agg/tabular is a stronger determinant of where I can call it than whether the thing is builtin or not.

@cpcloud cpcloud force-pushed the builtin-udafs branch 2 times, most recently from d95bfb6 to 6f3f4af Compare September 14, 2023 17:26
@jcrist
Copy link
Member

jcrist commented Sep 14, 2023

I think when I am reading UDF signatures the first thing I want to know is where I can use it, and scalar/agg/tabular is a stronger determinant of where I can call it than whether the thing is builtin or not.

That's a good point - I think for reading existing udfs scalar/agg/tabular is more critical, but for writing whether it's builtin/pyarrow/pandas feels more critical. I also don't have strong thoughts on this, what we have now seems fine, we can always reorg later if it a different layout becomes clearly better.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, excited to use this new functionality

]
)
)
expr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example

@cpcloud cpcloud enabled auto-merge (rebase) September 14, 2023 18:18
@cpcloud cpcloud merged commit c383f62 into ibis-project:master Sep 14, 2023
@cpcloud cpcloud deleted the builtin-udafs branch September 14, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Experimental features feature Features or general enhancements sql Backends that generate SQL udf Issues related to user-defined functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants