-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add median, std, and corr functions #1486
Comments
I am interested in helping with this - I am very curious about the result of the db-benchmark as well and want to help. Does it make sense to create separate issues for the functions? I started looking into stddev and maybe somebody else want to work on the others. @matthewmturner @houqp |
@realno happy to have the help :) i actually just updated my PR to get datafusion included in db-benchmark with the current features. hopefully getting close. separate issues per function sounds good. we can just use this as a tracker issue - ill add a task list to it. |
PR for standard deviation is up for review: #1525 Please let me know who I should add as reviewer. |
I have a update/question for this issue: There are some operations needed to calculate median: 1. sort 2. count 3. get nth value. I thought about few options:
My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward. |
@realno you can take this with a grain of salt as I am new to this. My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:
I suppose this would be somewhere between your Option 1 and Option 2. i definitely defer to @alamb though. |
Thanks for the comments @matthewmturner . I am also new and wouldn't call myself database internal expert :) Yes we have all the functionality ready, the complication is what's the best/most efficient way to implement this. I definitely want to hear more opinions on this. Do you think it worth having a approximation to unblock the perf benchmark work? |
(we should probably bring this discussion into a new ticket, FWIW) TLDR is I think The key problem is that you basically need to have buffered the entire input stream before you know what the output is Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable. To implement an exact median, you could implement an If you wanted to try and reuse existing operators (e.g. The mapping of Another challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together)
|
In principle we could aggregate the partials via a But I agree that the majority of cases we want an online estimate. |
Let me move the conversation to a new issue. I agree with @jorgecarleitao that we could introduce an API in addition to current Aggregator to provide the functionality - the current For the sake of this task, I suggest we close with the approximation version, what do you think? @matthewmturner are you comfortable using the approximate version for your benchmark work? |
@realno practically, yes definitely okay with approximate version. Within the context of performance benchmark my only concern is that we are able to produce the expected results. For example, i see clickhouse uses |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In the context of adding datafusion to db-benchmark (#147) there are some advanced group by queries that are benchmarked which require median, standard deviation, and correlation functions which datafusion does not currently provide out of the box.
Describe the solution you'd like
Builtin support for median, standard deviation, and correlation functions.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: