-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add approx_median()
aggregate function
#1729
Conversation
@matthewmturner This should close #1486 |
@realno this is great, thank you. im excited to be able to refresh the db-benchmarks results with this and your other work. just a couple questions. I believe all the tests currently have the array data sorted already. Do you think we should have some where it is not sorted? i'm assuming that the Can you also provide more color on the intended handling of nulls? It would help me to understand the below test.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @realno
I wonder if we should call this function approx_median
rather than median
to make it clear it is not an exact calculation?
I left some suggestions about how we might be able to reuse some more of the ApproxQuantile
code to avoid repetition
Overall very nice work and tests. Thank you
cc @domodwyer (approx percentile is already being used 👍 )
4e02875
to
170b8e6
Compare
@alamb Am I missing something by any chance? |
@matthewmturner Great questions. To help clarify I added a new test case hopefully can make it more clear
It also demonstrates how |
I had a thought this morning: what if we didn't introduce an so a query that has select approx_median(x) from foo; could be rewritten to select approx_percentile(x, 5.0) as "approx_median(x)" from foo; Similar to how we transform this may be a silly idea, but I wanted to write it down |
@alamb I think this is a actually a good idea. Let me look at the code you shared. I was thinking about something like this when started looking at median (the exact version). If we have some official support for rewrite query and logical plan that'll help. I was thinking to implement a
can be rewrite to something like
Though rewriting query will introduce potential security risks I think rewriting logical plan is a better option. |
@realno thank you for the explanation and added test - it makes sense. |
@alamb I add another version using optimizer, I think it works too. It is a little cleaner, it does introduced a bit complexity but should run a little faster too. Please take another look and we can decide which version to merge. |
@realno Thanks for this PR! I agree that rewriting the query to use the percentile function is conceptually a bit easier. So a +1 on the optimizer rule approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the plan rewriting approach (though of course I am biased). Thank you @realno
In terms of the planning overhead, I agree it isn't ideal, though I think we can improve things over time by consolidating several of the optimizer passes together
If we need more drastic performance improvements I have wanted to make a LogicalPlanRewriter
(like ExprRewriter
) for a while now that would avoid so many copies -- it is a fair amount of work but all pretty mechanical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enjoyed reading this to learn how plan rewriting is implemented - it's very elegant, thanks @realno!
Sounds good - will proceed with the optimizer route.
There are a few things I noticed that are not ideal, I will update the PR later we can discuss in more details. I think it is a good idea to have a plan to improve over time - I will create the issues after discussion. Here are the things I noticed for now:
I like this idea, it may also help with the issues I mentioned above. |
Functionality-wise I think I can make it happen. There is a part for dealing with Projection and Alias (will push the code soon) can be a bit controversial, we'll see if we are comfortable with merging as-is. Or if we really want to make it happen, another option is to merge with the wrapper implementation first while working on the optimizer rule. Let me push the code first then we can discuss again with @alamb . |
🤔 I am not sure about how the DF 7.0.0 release is going to to go down (as in if we should wait for specific things or just cut the release "when we are ready") 🤔 |
The PR is ready for final review before merge. FYI, I decided not to add the changes for handling Projection and Alias because the change became quite cumbersome to handle all the corner cases - it needs to traverse the plan twice in order to properly register expression names and schemas; it also does a lot of copies along the way. With the existing approach, the only caveat is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @realno
With the existing approach, the only caveat is ToApproxPerc must be applied last. Considering it is simple and relatively efficient I decided to go with this approach.
I think this makes sense . The test coverage will ensure this continues to work 👍
I wonder if datafusion/src/physical_plan/expressions/approx_median.rs
is still needed. Otherwise I think this PR is ready to go.
Thanks again for sticking with it!
@@ -0,0 +1,75 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file needed anymore? I think it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still needed as a wrapper for the expression.
datafusion/src/physical_plan/expressions/approx_percentile_cont.rs
Outdated
Show resolved
Hide resolved
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sticking with this @realno
approx_median()
aggregate function
Which issue does this PR close?
Closes #1486 .
Rationale for this change
Add the median operator, this should close #1486 and unblock the perf test work.
The current implementation uses
approx_percentile_cont
under the hood. We can look into implement the "exact" version later. More discussions around this can be found in the issue.What changes are included in this PR?
Are there any user-facing changes?
No change to existing APIs. New
median
operator added.