Add `approx_median()` aggregate function #1729

realno · 2022-02-02T07:46:10Z

Which issue does this PR close?

Closes #1486 .

Rationale for this change

Add the median operator, this should close #1486 and unblock the perf test work.

The current implementation uses approx_percentile_cont under the hood. We can look into implement the "exact" version later. More discussions around this can be found in the issue.

What changes are included in this PR?

Are there any user-facing changes?

No change to existing APIs. New median operator added.

realno · 2022-02-02T07:46:50Z

@matthewmturner This should close #1486

matthewmturner · 2022-02-02T17:34:36Z

@realno this is great, thank you. im excited to be able to refresh the db-benchmarks results with this and your other work. just a couple questions.

I believe all the tests currently have the array data sorted already. Do you think we should have some where it is not sorted? i'm assuming that the Median function and approx_percentile_cont dont require input data to be sorted in order to use (sry if thats wrong assumption, i havent had chance to look into implementation of approx_percentile_cont).

Can you also provide more color on the intended handling of nulls? It would help me to understand the below test.

#[test]
    fn median_i32_with_nulls() -> Result<()> {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(1),
            None,
            Some(3),
            Some(4),
            Some(5),
        ]));
        generic_test_op!(
            a,
            DataType::Int32,
            Median,
            ScalarValue::from(3),
            DataType::Int32
        )
    }

alamb

Thank you @realno

I wonder if we should call this function approx_median rather than median to make it clear it is not an exact calculation?

I left some suggestions about how we might be able to reuse some more of the ApproxQuantile code to avoid repetition

Overall very nice work and tests. Thank you

cc @domodwyer (approx percentile is already being used 👍 )

datafusion/src/physical_plan/expressions/median.rs

datafusion/tests/sql/aggregates.rs

realno · 2022-02-03T07:07:53Z

I left some suggestions about how we might be able to reuse some more of the ApproxQuantile code to avoid repetition

@alamb Am I missing something by any chance?

realno · 2022-02-03T07:21:12Z

@matthewmturner Great questions. approx_percentile_cont is based on tdigest to calculate the statistics of the input online, so it does not require the data set to be sorted. This (sort) is also the reason that exact median is a bit tricky to be calculated efficiently.

To help clarify I added a new test case hopefully can make it more clear

 fn approx_median_i32_with_nulls_2() -> Result<()> {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(5),
            Some(1),
            None,
            None,
            Some(3),
            Some(4),
        ]));
        generic_test_op!(
            a,
            DataType::Int32,
            ApproxMedian,
            ScalarValue::from(2),
            DataType::Int32
        )
    }

It also demonstrates how null values are handled. They are included in the result, that is, the median is the value (may be Interpolation) positioned in the center of sorted inputs.

alamb · 2022-02-03T11:54:01Z

@alamb Am I missing something by any chance?

I had a thought this morning: what if we didn't introduce an Aggregate function at all, and instead simply rewrote queries

so a query that has

select approx_median(x) from foo;

could be rewritten to

select approx_percentile(x, 5.0) as "approx_median(x)" from foo;

Similar to how we transform SELECT count(distinct x) from foo to select count(*) from (select x from foo group by x) in https://github.com/realno/arrow-datafusion/blob/add-median-operator/datafusion/src/optimizer/single_distinct_to_groupby.rs#L44

this may be a silly idea, but I wanted to write it down

realno · 2022-02-03T16:37:11Z

I had a thought this morning: what if we didn't introduce an Aggregate function at all, and instead simply rewrote queries

@alamb I think this is a actually a good idea. Let me look at the code you shared.

I was thinking about something like this when started looking at median (the exact version). If we have some official support for rewrite query and logical plan that'll help. I was thinking to implement a CombinedOperator that can be expanded during planning phase but didn't have enough time to explore too much - it does not appear to be trivial. For example,

PROJECT
    MEDIAN c1
...

can be rewrite to something like

PROJECT 
    FINDN 
        SORT c1
...

Though rewriting query will introduce potential security risks I think rewriting logical plan is a better option.

matthewmturner · 2022-02-03T16:56:55Z

@realno thank you for the explanation and added test - it makes sense.

realno · 2022-02-05T07:40:40Z

@alamb I add another version using optimizer, I think it works too. It is a little cleaner, it does introduced a bit complexity but should run a little faster too. Please take another look and we can decide which version to merge.

Dandandan · 2022-02-05T16:30:54Z

@realno Thanks for this PR!

I agree that rewriting the query to use the percentile function is conceptually a bit easier.

So a +1 on the optimizer rule approach.

alamb

I also like the plan rewriting approach (though of course I am biased). Thank you @realno

In terms of the planning overhead, I agree it isn't ideal, though I think we can improve things over time by consolidating several of the optimizer passes together

If we need more drastic performance improvements I have wanted to make a LogicalPlanRewriter (like ExprRewriter) for a while now that would avoid so many copies -- it is a fair amount of work but all pretty mechanical.

domodwyer

I enjoyed reading this to learn how plan rewriting is implemented - it's very elegant, thanks @realno!

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs

datafusion/src/physical_plan/expressions/approx_median_old.rs

matthewmturner · 2022-02-07T17:41:06Z

@alamb @realno do you think that this could be finished in time for inclusion in the 7.0 release? I was hoping it would so we could use in the python bindings for refreshing benchmarks.

realno · 2022-02-07T20:02:58Z

Sounds good - will proceed with the optimizer route.

@alamb

In terms of the planning overhead, I agree it isn't ideal, though I think we can improve things over time by consolidating several of the optimizer passes together

There are a few things I noticed that are not ideal, I will update the PR later we can discuss in more details. I think it is a good idea to have a plan to improve over time - I will create the issues after discussion. Here are the things I noticed for now:

For at least aggregate functions some traits/structs from use crate::physical_plan::aggregates are leaked into the logical planning phase, e.g. fun in this code block

 match expr {
        Expr::AggregateFunction {
            fun,
            args,
            distinct,
        } => {
            let mut new_args = args.clone();
            let mut new_func = fun.clone();
            if fun == &aggregates::AggregateFunction::ApproxMedian {
                new_args.push(Expr::Literal(ScalarValue::Float64(Some(0.5_f64))));
                new_func = aggregates::AggregateFunction::ApproxPercentileCont;
            }

Changing functions also includes rewriting Projections and Aliases, this is very tedious especially some util functions are not public and it replicates part of the work for building the plan. Also it may potentially have conflicts with other optimize rules if not super careful. Ideally these kind of of rewrites could happen before building the logical plan, maybe we can introduce a pre-build phase? I am working the part handling Projection and Alias, it'll be more clear looking at the code - I am using string replacement.

If we need more drastic performance improvements I have wanted to make a LogicalPlanRewriter (like ExprRewriter) for a while now that would avoid so many copies -- it is a fair amount of work but all pretty mechanical.

I like this idea, it may also help with the issues I mentioned above.

realno · 2022-02-07T20:12:15Z

@alamb @realno do you think that this could be finished in time for inclusion in the 7.0 release? I was hoping it would so we could use in the python bindings for refreshing benchmarks.

Functionality-wise I think I can make it happen. There is a part for dealing with Projection and Alias (will push the code soon) can be a bit controversial, we'll see if we are comfortable with merging as-is.

Or if we really want to make it happen, another option is to merge with the wrapper implementation first while working on the optimizer rule.

Let me push the code first then we can discuss again with @alamb .

alamb · 2022-02-07T21:15:41Z

🤔 I am not sure about how the DF 7.0.0 release is going to to go down (as in if we should wait for specific things or just cut the release "when we are ready") 🤔

realno · 2022-02-08T07:40:46Z

The PR is ready for final review before merge.

FYI, I decided not to add the changes for handling Projection and Alias because the change became quite cumbersome to handle all the corner cases - it needs to traverse the plan twice in order to properly register expression names and schemas; it also does a lot of copies along the way. With the existing approach, the only caveat is ToApproxPerc must be applied last. Considering it is simple and relatively efficient I decided to go with this approach.

alamb

Thank you @realno

With the existing approach, the only caveat is ToApproxPerc must be applied last. Considering it is simple and relatively efficient I decided to go with this approach.

I think this makes sense . The test coverage will ensure this continues to work 👍

I wonder if datafusion/src/physical_plan/expressions/approx_median.rs is still needed. Otherwise I think this PR is ready to go.

Thanks again for sticking with it!

datafusion/src/optimizer/to_approx_perc.rs

datafusion/src/physical_plan/aggregates.rs

alamb · 2022-02-08T11:51:19Z

datafusion/src/physical_plan/expressions/approx_median.rs

@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Is this file needed anymore? I think it can be removed

It is still needed as a wrapper for the expression.

datafusion/src/physical_plan/expressions/approx_percentile_cont.rs

Co-authored-by: Andrew Lamb <[email protected]>

alamb

Thank you for sticking with this @realno

realno added 2 commits February 1, 2022 23:41

add median operator

bab7b46

update doc

9b9abe3

github-actions bot added ballista datafusion Changes in the datafusion crate documentation Improvements or additions to documentation labels Feb 2, 2022

alamb reviewed Feb 2, 2022

View reviewed changes

datafusion/src/physical_plan/expressions/median.rs Outdated Show resolved Hide resolved

datafusion/tests/sql/aggregates.rs Outdated Show resolved Hide resolved

realno added 2 commits February 2, 2022 23:01

rename median to approx_median

07f5819

rename median to approx_median

170b8e6

realno force-pushed the add-median-operator branch from 4e02875 to 170b8e6 Compare February 3, 2022 07:05

add doc

03b64df

realno added 2 commits February 5, 2022 02:05

test optimizer

56b1ad3

try rewriting logical plan

9e63810

realno changed the title ~~Add median operator~~ Add approx-median operator Feb 5, 2022

realno added 3 commits February 6, 2022 12:46

move rewrite rule to earlier stages

89fcd51

fix lint

1984854

move the rule after projection push down

b9425d8

alamb reviewed Feb 7, 2022

View reviewed changes

domodwyer reviewed Feb 7, 2022

View reviewed changes

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs Outdated Show resolved Hide resolved

datafusion/src/physical_plan/expressions/approx_median_old.rs Outdated Show resolved Hide resolved

realno added 2 commits February 7, 2022 23:24

get ready to merge

7ddf81b

Merge branch 'master' into add-median-operator

138bc06

remove unused function

5c514fb

alamb reviewed Feb 8, 2022

View reviewed changes

realno and others added 2 commits February 8, 2022 17:09

remove commented out code

edf3495

Update datafusion/src/optimizer/to_approx_perc.rs

f48db26

Co-authored-by: Andrew Lamb <[email protected]>

alamb approved these changes Feb 9, 2022

View reviewed changes

alamb merged commit 6e02d2d into apache:master Feb 9, 2022

realno deleted the add-median-operator branch February 9, 2022 20:09

alamb added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Feb 10, 2022

alamb changed the title ~~Add approx-median operator~~ Add approx_median() aggregate function Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `approx_median()` aggregate function #1729

Add `approx_median()` aggregate function #1729

realno commented Feb 2, 2022

realno commented Feb 2, 2022

matthewmturner commented Feb 2, 2022

alamb left a comment

realno commented Feb 3, 2022

realno commented Feb 3, 2022 •

edited

Loading

alamb commented Feb 3, 2022

realno commented Feb 3, 2022

matthewmturner commented Feb 3, 2022

realno commented Feb 5, 2022

Dandandan commented Feb 5, 2022 •

edited

Loading

alamb left a comment

domodwyer left a comment

matthewmturner commented Feb 7, 2022

realno commented Feb 7, 2022

realno commented Feb 7, 2022

alamb commented Feb 7, 2022

realno commented Feb 8, 2022

alamb left a comment

alamb Feb 8, 2022

realno Feb 9, 2022

alamb left a comment

		@@ -0,0 +1,75 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Add approx_median() aggregate function #1729

Add approx_median() aggregate function #1729

Conversation

realno commented Feb 2, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

realno commented Feb 2, 2022

matthewmturner commented Feb 2, 2022

alamb left a comment

Choose a reason for hiding this comment

realno commented Feb 3, 2022

realno commented Feb 3, 2022 • edited Loading

alamb commented Feb 3, 2022

realno commented Feb 3, 2022

matthewmturner commented Feb 3, 2022

realno commented Feb 5, 2022

Dandandan commented Feb 5, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

domodwyer left a comment

Choose a reason for hiding this comment

matthewmturner commented Feb 7, 2022

realno commented Feb 7, 2022

realno commented Feb 7, 2022

alamb commented Feb 7, 2022

realno commented Feb 8, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 8, 2022

Choose a reason for hiding this comment

realno Feb 9, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Add `approx_median()` aggregate function #1729

Add `approx_median()` aggregate function #1729

realno commented Feb 3, 2022 •

edited

Loading

Dandandan commented Feb 5, 2022 •

edited

Loading