-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF #10917
Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF #10917
Conversation
// TODO: 'logical_exprs' is not supported for UDAF yet. | ||
// approx_percentile_cont and approx_percentile_cont_weight are not supported for UDAF from protobuf yet. | ||
let logical_exprs = &[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find an easy way to get the logical plan from protobuf. I think maybe we need to convert the physical plan back to the logical plan. Otherwise, approx_percentile_cont
and approx_percentile_cont_weighted
can't be supported for protobuf.
@jayzhan211, what do you think? Are there any existing functions or tools that can do this?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the error if we leave it empty? I run the test in logical plan that includes approx_percentile_cont
and it seems correct 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be correct for the logical plan. However, I think it will cause some issues when processing the physical plan. Curiously, should we expect the physical roundtrip to work? I tried to add a roundtrip test for it indatafusion/proto/tests/cases/roundtrip_physical_plan.rs
like:
#[test]
fn roundtrip_approx_percentile_cont_udaf() -> Result<()> {
let field_a = Field::new("a", DataType::Int64, false);
let field_b = Field::new("b", DataType::Float64, false);
let schema = Arc::new(Schema::new(vec![field_a, field_b]));
let udaf = (*approx_percentile_cont_udaf()).clone();
let ctx = SessionContext::new();
ctx.register_udaf(udaf.clone());
let groups: Vec<(Arc<dyn PhysicalExpr>, String)> =
vec![(col("a", &schema)?, "unused".to_string())];
let aggregates: Vec<Arc<dyn AggregateExpr>> = vec![udaf::create_aggregate_expr(
&udaf,
&[col("b", &schema)?, lit(ScalarValue::Float64(Some(0.5)))],
&[datafusion_expr::col("b"), datafusion_expr::lit(ScalarValue::Float64(Some(0.5)))],
&[],
&[],
&schema,
"approx_percentile_cont_agg",
false,
false,
)?];
roundtrip_test_with_context(
Arc::new(AggregateExec::try_new(
AggregateMode::Final,
PhysicalGroupBy::new_single(groups.clone()),
aggregates.clone(),
vec![None],
Arc::new(EmptyExec::new(schema.clone())),
schema,
)?),
&ctx,
)
}
It will lose the logical_plan when converting back to the physical plan from protobuf.
Is this test necessary? I only see an physical-plan roudtrip testing of udaf called roundtrip_aggregate_udaf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to care about logical plan conversion in physical plan roundtrip, but I'm not pretty sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I thought Flight SQL might be a use case for from_proto
. However, I tried to query approx_percentile_cont
through Flight SQL, and it worked well. Maybe we can just ignore it and leave this TODO comment until we find any from_proto
use case.
It's really weird. I can't reproduce the CI failed in my local environment.
The error message thrown in my local is difference from the CI environment... :( |
I usually just remove the error message, although it is not an idea solution, but I don't know why the CI message is different from local too. |
} | ||
} | ||
Self { | ||
signature: Signature::one_of(variants, Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to have Signature::one_of(vec![Numeric::(2), Numeric::(3)])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found we can't use Signature::numeric
. I tried to use it, but I discovered that all the input arguments will have the same type as the first argument. For example, in a SQL query like:
select approx_percentile_cont(float64_col, 0.5, 100) from t
the input arguments will be vec![Float64, Float64, Float64]
. However, the last argument should be int64
. This will cause a failure.
I'm not sure if it's related, but I found something weird. Why is AccumulatorArgs#input_type
not a vec
but a single datatype? If the number of arguments is greater than 1, how do we handle the other arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is single type because we have not yet meet any function that takes multiple input to build accumulator. If there is, then we might need to extend it to Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the input arguments will be vec![Float64, Float64, Float64]
Oh, yes. It will coerce types to the unify one.
&self, | ||
args: AccumulatorArgs, | ||
) -> datafusion_common::Result<ApproxPercentileAccumulator> { | ||
let percentile = validate_input_percentile_expr(&args.args[1])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we rename args.args
to args.input_exprs
? I think it would be more clear and consistent with the name of args.input_type
&self, | ||
acc_args: AccumulatorArgs, | ||
) -> datafusion_common::Result<Box<dyn Accumulator>> { | ||
Ok(Box::new(self.create_accumulator(acc_args)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can inline the code of create_accumulator
here
}) | ||
pub fn new() -> Self { | ||
Self { | ||
signature: Signature::one_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Signature::Numeric(3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #10917 (comment)
"approx_percentile_cont_with_weight requires numeric weight input types" | ||
); | ||
} | ||
if !arg_types[2].is_floating() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with signature::numeric(3), we can just check if the third arg is f64
&& self.column_expr.eq(&x.column_expr) | ||
&& self.weight_expr.eq(&x.weight_expr) | ||
&& self.percentile_expr.eq(&x.percentile_expr) | ||
self.signature == x.signature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this, we can compare them with the function name
// TODO: 'logical_exprs' is not supported for UDAF yet. | ||
// approx_percentile_cont and approx_percentile_cont_weight are not supported for UDAF from protobuf yet. | ||
let logical_exprs = &[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to care about logical plan conversion in physical plan roundtrip, but I'm not pretty sure.
I modified the expected pattern of error message to match the message thrown in CI flow. The full message is
I noticed that the CI flow is missing the tail part of the error message: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @jayzhan211 approved it. : ) |
Sure! |
Thanks again @jayzhan211 :) |
…AF (apache#10917) * pass logical expr of arguments for udaf * implement approx_percentile_cont udaf * register udaf * remove ApproxPercentileCont * convert with_wegiht to udaf and remove original * fix conflict * fix compile check * fix doc and testing * evaluate args through physical plan * public use Literal * fix tests * rollback the experimental tests * remove unused import * rename args and inline code * remove unnecessary partial eq trait * fix error message
… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
* chore: update datafusion deps * feat: impl ExecutionPlan::static_name() for DatasetExec This required trait method was added upstream [0] and recommends to simply forward to `static_name`. [0]: apache/datafusion#10266 * feat: update first_value and last_value wrappers. Upstream signatures were changed for the new new `AggregateBuilder` api [0]. This simply gets the code to work. We should better incorporate that API into `datafusion-python`. [0] apache/datafusion#10560 * migrate count to UDAF Builtin Count was removed upstream. TBD whether we want to re-implement `count_star` with new API. Ref: apache/datafusion#10893 * migrate approx_percentile_cont, approx_distinct, and approx_median to UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917 * migrate avg to UDAF Ref: apache/datafusion#10964 * migrage corr to UDAF Ref: apache/datafusion#10884 * migrate grouping to UDAF Ref: apache/datafusion#10906 * add alias `mean` for UDAF `avg` * migrate stddev to UDAF Ref: apache/datafusion#10827 * remove rust alias for stddev The python wrapper now provides stddev_samp alias. * migrage var_pop to UDAF Ref: apache/datafusion#10836 * migrate regr_* functions to UDAF Ref: apache/datafusion#10898 * migrate bitwise functions to UDAF The functions now take a single expression instead of a Vec<_>. Ref: apache/datafusion#10930 * add missing variants for ScalarValue with todo * fix typo in approx_percentile_cont * add distinct arg to count * comment out failing test `approx_percentile_cont` is now returning a DoubleArray instead of an IntArray. This may be a bug upstream; it requires further investigation. * update tests to expect lowercase `sum` in query plans This was changed upstream. Ref: apache/datafusion#10831 * update ScalarType data_type map * add docs dependency pickleshare * re-implement count_star * lint: ruff python lint * lint: rust cargo fmt * include name of window function in error for find_window_fn * refactor `find_window_fn` for debug clarity * search default aggregate functions by both name and aliases The alias list no longer includes the name of the function. Ref: apache/datafusion#10658 * fix markdown in find_window_fn docs * parameterize test_window_functions `first_value` and `last_value` are currently failing and marked as xfail. * add test ids to test_simple_select tests marked xfail * update find_window_fn to search built-ins first The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior. This allowed me to remove `marks=pytest.xfail` from the window tests. * improve first_call and last_call use of the builder API * remove trailing todos * fix examples/substrait.py * chore: remove explicit aliases from functions.rs Ref: #779 * remove `array_fn!` aliases * remove alias rules for `expr_fn_vec!` * remove alias rules from `expr_fn!` macro * remove unnecessary pyo3 var-arg signatures in functions.rs * remove pyo3 signatures that provided defaults for first_value and last_value * parametrize test_string_functions * test regr_ function wrappers Closes #778
Which issue does this PR close?
Closes #10870 and convert ApproxPercentileContWithWeight to udaf.
Rationale for this change
What changes are included in this PR?
args
toAccumulatorArgs
.AccumulatorArgs
, useargs.len()
instead.ApproxPercentileCont
andApproxPercentileContWithWeight
to udafAre these changes tested?
yes
Are there any user-facing changes?
no