Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove special casting of Min / Max built in AggregateFunctions #11151

Closed
2 tasks done
alamb opened this issue Jun 27, 2024 · 6 comments · Fixed by #12296
Closed
2 tasks done

Remove special casting of Min / Max built in AggregateFunctions #11151

alamb opened this issue Jun 27, 2024 · 6 comments · Fixed by #12296
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

Is your feature request related to a problem or challenge?

Part of #10943

While trying to port min and max to UDFs in #11013, @edmondop found several places where Min and Max (the existing built in aggregate functions) are special cased

Here is Max:

pub struct Max {
name: String,
data_type: DataType,
nullable: bool,
expr: Arc<dyn PhysicalExpr>,
}

The problem with relying on Min and Max directly is that it is hard/impossible to switch Min/Max to be a user defined aggregates as seen in #11013

Also, relying on Min/Max directly means that there is certain behavior that is not available to UDAFs compared to build in aggregate functions, which isn't ideal

Describe the solution you'd like

Remove all explicit references to Min / Max

Specifically, code that looks like this should be removed

       expr.as_any().downcast_ref::<Max>()

and

       expr.as_any().downcast_ref::<Min>()

Describe alternatives you've considered

Ideally, the alternative is to use a function on AggregateExpr which we can then add/implement for UDAFs when we port min/max to be a UDAF

Task List

Additional context

No response

@edmondop
Copy link
Contributor

Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?

@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2024

Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?

It might make sense to special case is_min and is_max

What we did in the case of ScalarUDFs was to go with functions that described the property in question was rather than what the function was. So for example, for coalesce and some array functions which needed special case type coercion rules, @jayzhan211 added a way to provide user defined coercion rules, rather than a is_coalesce function

I was thinking we would do something similar in this case. However, it if turns out that all such properties are only relevant for min/max maybe having a is_min / is_max API would be better than enumerating all the properties

@edmondop
Copy link
Contributor

@alamb I think this has been solved?

@jayzhan211
Copy link
Contributor

This is not, we need to address the specialization check in aggregate_statistic

@alamb
Copy link
Contributor Author

alamb commented Sep 2, 2024

Specifically I think these checks need to be removed:

fn is_non_distinct_count(agg_expr: &AggregateFunctionExpr) -> bool {
if agg_expr.fun().name() == "count" && !agg_expr.is_distinct() {
return true;
}
false
}
// TODO: Move this check into AggregateUDFImpl
// https://github.com/apache/datafusion/issues/11153
fn is_min(agg_expr: &AggregateFunctionExpr) -> bool {
if agg_expr.fun().name().to_lowercase() == "min" {
return true;
}
false
}
// TODO: Move this check into AggregateUDFImpl
// https://github.com/apache/datafusion/issues/11153
fn is_max(agg_expr: &AggregateFunctionExpr) -> bool {
if agg_expr.fun().name().to_lowercase() == "max" {
return true;
}
false
}

@alamb
Copy link
Contributor Author

alamb commented Sep 6, 2024

Here is one specific suggestion: #12296 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants