-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce reverse_expr
for UDAF
#10214
Conversation
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
/// Typically the "reverse" expression is itself (e.g. SUM, COUNT). | ||
/// For aggregates that do not support calculation in reverse, | ||
/// returns None (which is the default value). | ||
fn reverse_expr(&self) -> ReversedExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to take the owned type, but
function arguments must have a statically known size, borrowed types always have a known size: `&`
)?; | ||
|
||
// TODO: We don't have a nice way to test the change without introducing many other things | ||
// We check with the output string. `ignore nulls` is expeceted to be false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can even remove fn test_reverse_udaf
if first/last are moved to UDAF based.
Thanks @jayzhan211 -- I'll check this out tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnaks @jayzhan211 -- looking neat. I had some questions about the actual API though
/// The expression is the same as the original expression, like SUM, COUNT | ||
Identical, | ||
/// The expression does not support reverse calculation, like ArrayAgg | ||
NotSupported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time trying to understand this -- is the reason that ArrayAgg doesn't support reversed calculations that the array elements would be in a different order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because ArrayAgg has no ordering info.
You can also take a look at this comment #9972 (comment)
Although I found that the counter-example might not be correct after a few days, ARRAY_AGG(b ORDER BY c DESC)
is OrderSensitiveArrayAgg, so it is possible to produce reverse expr, but it still makes sense to me that order insensitive ArrayAgg (which is ArrayAgg) does not support reverse_expr
. cc @mustafasrepo
/// The expression does not support reverse calculation, like ArrayAgg | ||
NotSupported, | ||
/// The expression is different from the original expression | ||
Reversed(AggregateFunction), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about returning an AggregateFunction
here -- this method is part of AggregateUDFImpl
trait, but AggregateFunction
has arguments and other fields that I don't think will be accessable to an instance of AggregateUDFImpl
.
datafusion/datafusion/expr/src/expr.rs
Lines 578 to 591 in 4edbdd7
#[derive(Clone, PartialEq, Eq, Hash, Debug)] | |
pub struct AggregateFunction { | |
/// Name of the function | |
pub func_def: AggregateFunctionDefinition, | |
/// List of expressions to feed to the functions as arguments | |
pub args: Vec<Expr>, | |
/// Whether this is a DISTINCT aggregation or not | |
pub distinct: bool, | |
/// Optional filter | |
pub filter: Option<Box<Expr>>, | |
/// Optional ordering | |
pub order_by: Option<Vec<Expr>>, | |
pub null_treatment: Option<NullTreatment>, | |
} |
Would it make more sense to return an AggregateUDF
here:
datafusion/datafusion/expr/src/udaf.rs
Lines 33 to 67 in a325825
/// Logical representation of a user-defined [aggregate function] (UDAF). | |
/// | |
/// An aggregate function combines the values from multiple input rows | |
/// into a single output "aggregate" (summary) row. It is different | |
/// from a scalar function because it is stateful across batches. User | |
/// defined aggregate functions can be used as normal SQL aggregate | |
/// functions (`GROUP BY` clause) as well as window functions (`OVER` | |
/// clause). | |
/// | |
/// `AggregateUDF` provides DataFusion the information needed to plan and call | |
/// aggregate functions, including name, type information, and a factory | |
/// function to create an [`Accumulator`] instance, to perform the actual | |
/// aggregation. | |
/// | |
/// For more information, please see [the examples]: | |
/// | |
/// 1. For simple use cases, use [`create_udaf`] (examples in [`simple_udaf.rs`]). | |
/// | |
/// 2. For advanced use cases, use [`AggregateUDFImpl`] which provides full API | |
/// access (examples in [`advanced_udaf.rs`]). | |
/// | |
/// # API Note | |
/// This is a separate struct from `AggregateUDFImpl` to maintain backwards | |
/// compatibility with the older API. | |
/// | |
/// [the examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples#single-process | |
/// [aggregate function]: https://en.wikipedia.org/wiki/Aggregate_function | |
/// [`Accumulator`]: crate::Accumulator | |
/// [`create_udaf`]: crate::expr_fn::create_udaf | |
/// [`simple_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udaf.rs | |
/// [`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs | |
#[derive(Debug, Clone)] | |
pub struct AggregateUDF { | |
inner: Arc<dyn AggregateUDFImpl>, | |
} |
Maybe it would help guide the API design to implement this API for first_value/last_value udafs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the reverse expr in OrderSensitiveArrayAgg
fn reverse_expr(&self) -> Option<Arc<dyn AggregateExpr>> {
Some(Arc::new(Self {
name: self.name.to_string(),
input_data_type: self.input_data_type.clone(),
expr: self.expr.clone(),
nullable: self.nullable,
order_by_data_types: self.order_by_data_types.clone(),
// Reverse requirement:
ordering_req: reverse_order_bys(&self.ordering_req),
reverse: !self.reverse,
}))
}
We need the ordering info, it is not contained in either AggregateUDF
or AggregateUDFImpl
function. Therefore, I return AggregateFunction
instead. Then, we can create AggregateExec with create_physical_expr
to get the physical-expr from these logical exprs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I wonder why does the AggregateUDFImpl
need to actually reverse the ordering? Couldn't the code that calls reverse_expr()
do the actual call to reverse_order_by
?
As in maybe AggregateFunctionExpr::reverse_expr
could call reverse_order_by
and AggregateUdfImpl::reverse_expr
would only return an AggregateUDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look at the code that uses reverse_expr
, I guess some kind of rewrite may help.
@@ -39,3 +39,4 @@ path = "src/lib.rs" | |||
arrow = { workspace = true } | |||
datafusion-common = { workspace = true, default-features = true } | |||
datafusion-expr = { workspace = true } | |||
sqlparser = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally the physical expr shouldn't be depending on sql parser (for people who are not using SQL)
Though now I see that perhaps it is because AggregateFunction has a null treatment flag on it 🤔 That might be a nice dependency to avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is easy to avoid this dependency, we can check this token in parser, then pass boolean all the way down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me file an issue for this
Which issue does this PR close?
Closes #.
Part of #10091
Rationale for this change
reverse_expr
has a different meaning between returning clone and None. Together with reversing, there are three cases.I think the proposed Enum makes the user understand what they can do in
reverse_expr
.What changes are included in this PR?
Are these changes tested?
I added a simple UDAF for testing. I want to avoid adding too many other things just for test only because most of them can be covered after moving UDAF from built-in.
Are there any user-facing changes?