Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add approx_median() aggregate function #1729

Merged
merged 15 commits into from
Feb 9, 2022
Merged

Conversation

realno
Copy link
Contributor

@realno realno commented Feb 2, 2022

Which issue does this PR close?

Closes #1486 .

Rationale for this change

Add the median operator, this should close #1486 and unblock the perf test work.

The current implementation uses approx_percentile_cont under the hood. We can look into implement the "exact" version later. More discussions around this can be found in the issue.

What changes are included in this PR?

Are there any user-facing changes?

No change to existing APIs. New median operator added.

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate documentation Improvements or additions to documentation labels Feb 2, 2022
@realno
Copy link
Contributor Author

realno commented Feb 2, 2022

@matthewmturner This should close #1486

@matthewmturner
Copy link
Contributor

@realno this is great, thank you. im excited to be able to refresh the db-benchmarks results with this and your other work. just a couple questions.

I believe all the tests currently have the array data sorted already. Do you think we should have some where it is not sorted? i'm assuming that the Median function and approx_percentile_cont dont require input data to be sorted in order to use (sry if thats wrong assumption, i havent had chance to look into implementation of approx_percentile_cont).

Can you also provide more color on the intended handling of nulls? It would help me to understand the below test.

#[test]
    fn median_i32_with_nulls() -> Result<()> {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(1),
            None,
            Some(3),
            Some(4),
            Some(5),
        ]));
        generic_test_op!(
            a,
            DataType::Int32,
            Median,
            ScalarValue::from(3),
            DataType::Int32
        )
    }

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @realno

I wonder if we should call this function approx_median rather than median to make it clear it is not an exact calculation?

I left some suggestions about how we might be able to reuse some more of the ApproxQuantile code to avoid repetition

Overall very nice work and tests. Thank you

cc @domodwyer (approx percentile is already being used 👍 )

datafusion/src/physical_plan/expressions/median.rs Outdated Show resolved Hide resolved
datafusion/tests/sql/aggregates.rs Outdated Show resolved Hide resolved
@realno
Copy link
Contributor Author

realno commented Feb 3, 2022

I left some suggestions about how we might be able to reuse some more of the ApproxQuantile code to avoid repetition

@alamb Am I missing something by any chance?

@realno
Copy link
Contributor Author

realno commented Feb 3, 2022

@matthewmturner Great questions. approx_percentile_cont is based on tdigest to calculate the statistics of the input online, so it does not require the data set to be sorted. This (sort) is also the reason that exact median is a bit tricky to be calculated efficiently.

To help clarify I added a new test case hopefully can make it more clear

 fn approx_median_i32_with_nulls_2() -> Result<()> {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(5),
            Some(1),
            None,
            None,
            Some(3),
            Some(4),
        ]));
        generic_test_op!(
            a,
            DataType::Int32,
            ApproxMedian,
            ScalarValue::from(2),
            DataType::Int32
        )
    }

It also demonstrates how null values are handled. They are included in the result, that is, the median is the value (may be Interpolation) positioned in the center of sorted inputs.

@alamb
Copy link
Contributor

alamb commented Feb 3, 2022

@alamb Am I missing something by any chance?

I had a thought this morning: what if we didn't introduce an Aggregate function at all, and instead simply rewrote queries

so a query that has

select approx_median(x) from foo;

could be rewritten to

select approx_percentile(x, 5.0) as "approx_median(x)" from foo;

Similar to how we transform SELECT count(distinct x) from foo to select count(*) from (select x from foo group by x) in https://github.com/realno/arrow-datafusion/blob/add-median-operator/datafusion/src/optimizer/single_distinct_to_groupby.rs#L44

this may be a silly idea, but I wanted to write it down

@realno
Copy link
Contributor Author

realno commented Feb 3, 2022

I had a thought this morning: what if we didn't introduce an Aggregate function at all, and instead simply rewrote queries

@alamb I think this is a actually a good idea. Let me look at the code you shared.

I was thinking about something like this when started looking at median (the exact version). If we have some official support for rewrite query and logical plan that'll help. I was thinking to implement a CombinedOperator that can be expanded during planning phase but didn't have enough time to explore too much - it does not appear to be trivial. For example,

PROJECT
    MEDIAN c1
...

can be rewrite to something like

PROJECT 
    FINDN 
        SORT c1
...

Though rewriting query will introduce potential security risks I think rewriting logical plan is a better option.

@matthewmturner
Copy link
Contributor

@realno thank you for the explanation and added test - it makes sense.

@realno
Copy link
Contributor Author

realno commented Feb 5, 2022

@alamb I add another version using optimizer, I think it works too. It is a little cleaner, it does introduced a bit complexity but should run a little faster too. Please take another look and we can decide which version to merge.

@Dandandan
Copy link
Contributor

Dandandan commented Feb 5, 2022

@realno Thanks for this PR!

I agree that rewriting the query to use the percentile function is conceptually a bit easier.

So a +1 on the optimizer rule approach.

@realno realno changed the title Add median operator Add approx-median operator Feb 5, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the plan rewriting approach (though of course I am biased). Thank you @realno

In terms of the planning overhead, I agree it isn't ideal, though I think we can improve things over time by consolidating several of the optimizer passes together

If we need more drastic performance improvements I have wanted to make a LogicalPlanRewriter (like ExprRewriter) for a while now that would avoid so many copies -- it is a fair amount of work but all pretty mechanical.

Copy link
Contributor

@domodwyer domodwyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoyed reading this to learn how plan rewriting is implemented - it's very elegant, thanks @realno!

@matthewmturner
Copy link
Contributor

@alamb @realno do you think that this could be finished in time for inclusion in the 7.0 release? I was hoping it would so we could use in the python bindings for refreshing benchmarks.

@realno
Copy link
Contributor Author

realno commented Feb 7, 2022

Sounds good - will proceed with the optimizer route.

@alamb

In terms of the planning overhead, I agree it isn't ideal, though I think we can improve things over time by consolidating several of the optimizer passes together

There are a few things I noticed that are not ideal, I will update the PR later we can discuss in more details. I think it is a good idea to have a plan to improve over time - I will create the issues after discussion. Here are the things I noticed for now:

  1. For at least aggregate functions some traits/structs from use crate::physical_plan::aggregates are leaked into the logical planning phase, e.g. fun in this code block
 match expr {
        Expr::AggregateFunction {
            fun,
            args,
            distinct,
        } => {
            let mut new_args = args.clone();
            let mut new_func = fun.clone();
            if fun == &aggregates::AggregateFunction::ApproxMedian {
                new_args.push(Expr::Literal(ScalarValue::Float64(Some(0.5_f64))));
                new_func = aggregates::AggregateFunction::ApproxPercentileCont;
            }
  1. Changing functions also includes rewriting Projections and Aliases, this is very tedious especially some util functions are not public and it replicates part of the work for building the plan. Also it may potentially have conflicts with other optimize rules if not super careful. Ideally these kind of of rewrites could happen before building the logical plan, maybe we can introduce a pre-build phase? I am working the part handling Projection and Alias, it'll be more clear looking at the code - I am using string replacement.

If we need more drastic performance improvements I have wanted to make a LogicalPlanRewriter (like ExprRewriter) for a while now that would avoid so many copies -- it is a fair amount of work but all pretty mechanical.

I like this idea, it may also help with the issues I mentioned above.

@realno
Copy link
Contributor Author

realno commented Feb 7, 2022

@alamb @realno do you think that this could be finished in time for inclusion in the 7.0 release? I was hoping it would so we could use in the python bindings for refreshing benchmarks.

Functionality-wise I think I can make it happen. There is a part for dealing with Projection and Alias (will push the code soon) can be a bit controversial, we'll see if we are comfortable with merging as-is.

Or if we really want to make it happen, another option is to merge with the wrapper implementation first while working on the optimizer rule.

Let me push the code first then we can discuss again with @alamb .

@alamb
Copy link
Contributor

alamb commented Feb 7, 2022

🤔 I am not sure about how the DF 7.0.0 release is going to to go down (as in if we should wait for specific things or just cut the release "when we are ready") 🤔

@realno
Copy link
Contributor Author

realno commented Feb 8, 2022

The PR is ready for final review before merge.

FYI, I decided not to add the changes for handling Projection and Alias because the change became quite cumbersome to handle all the corner cases - it needs to traverse the plan twice in order to properly register expression names and schemas; it also does a lot of copies along the way. With the existing approach, the only caveat is ToApproxPerc must be applied last. Considering it is simple and relatively efficient I decided to go with this approach.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @realno

With the existing approach, the only caveat is ToApproxPerc must be applied last. Considering it is simple and relatively efficient I decided to go with this approach.

I think this makes sense . The test coverage will ensure this continues to work 👍

I wonder if datafusion/src/physical_plan/expressions/approx_median.rs is still needed. Otherwise I think this PR is ready to go.

Thanks again for sticking with it!

datafusion/src/optimizer/to_approx_perc.rs Outdated Show resolved Hide resolved
datafusion/src/physical_plan/aggregates.rs Show resolved Hide resolved
@@ -0,0 +1,75 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed anymore? I think it can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still needed as a wrapper for the expression.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sticking with this @realno

@alamb alamb merged commit 6e02d2d into apache:master Feb 9, 2022
@realno realno deleted the add-median-operator branch February 9, 2022 20:09
@alamb alamb added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Feb 10, 2022
@alamb alamb changed the title Add approx-median operator Add approx_median() aggregate function Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add median, std, and corr functions
5 participants