Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add median, std, and corr functions #1486

Closed
matthewmturner opened this issue Dec 26, 2021 · 10 comments · Fixed by #1729
Closed

Add median, std, and corr functions #1486

matthewmturner opened this issue Dec 26, 2021 · 10 comments · Fixed by #1729
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@matthewmturner
Copy link
Contributor

matthewmturner commented Dec 26, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In the context of adding datafusion to db-benchmark (#147) there are some advanced group by queries that are benchmarked which require median, standard deviation, and correlation functions which datafusion does not currently provide out of the box.

Describe the solution you'd like

Builtin support for median, standard deviation, and correlation functions.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@matthewmturner matthewmturner added the enhancement New feature or request label Dec 26, 2021
@houqp houqp added the help wanted Extra attention is needed label Dec 26, 2021
@realno
Copy link
Contributor

realno commented Jan 4, 2022

I am interested in helping with this - I am very curious about the result of the db-benchmark as well and want to help. Does it make sense to create separate issues for the functions? I started looking into stddev and maybe somebody else want to work on the others. @matthewmturner @houqp

@matthewmturner
Copy link
Contributor Author

@realno happy to have the help :) i actually just updated my PR to get datafusion included in db-benchmark with the current features. hopefully getting close.

separate issues per function sounds good. we can just use this as a tracker issue - ill add a task list to it.

@realno
Copy link
Contributor

realno commented Jan 6, 2022

PR for standard deviation is up for review: #1525

Please let me know who I should add as reviewer.

@realno
Copy link
Contributor

realno commented Jan 19, 2022

I have a update/question for this issue: stddev and corr have been merged few days ago. I took a look at median today and had an initial idea how to implement it. Now the problem is the implementation might be somewhat controversial.

There are some operations needed to calculate median: 1. sort 2. count 3. get nth value. I thought about few options:

  1. Implement a new function/operator - the function will look like sort, we may need to add a new type of expression for it. It feels to be a lot of overhead just for median.
  • Pros: No mess around plan builder and query parsing logic
  • Cons: Need to add a specific expression just for median; there may be some duplicate logic regarding sort and count
  1. Reuse some of the existing code through rewriting the logic plan. If we can have sort to keep the total number of record, we can add a new function to rewrite the logical plan to something like FIND(n) | SORT(col). But currently there is no such behavior in the code base.
  • Pros: Better code reuse; May potentially provide a way to handle compound (high-order) functions, e.g. Function1 -> FunctionA(FunctionB(col))
  • Cons: Need to modify logic for building plans;
  1. Use approximate algorithms like KLL or t-digest. This way it can fit in existing aggregator API, but the result will be an approximation. There is already a PR Add approx_percentile_cont() aggregation function #1539, we can help merge that then implement median as quantile(0.5).
  • Pros: Easy to implement; Using existing aggregator API, Much better performance
  • Cons: The result is an approximation

My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward.

@matthewmturner @alamb

@matthewmturner
Copy link
Contributor Author

@realno you can take this with a grain of salt as I am new to this.

My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had DISTINCT before approx_distinct.

Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:

I suppose this would be somewhere between your Option 1 and Option 2.

i definitely defer to @alamb though.

@realno
Copy link
Contributor

realno commented Jan 19, 2022

@realno you can take this with a grain of salt as I am new to this.

My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had DISTINCT before approx_distinct.

Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:

I suppose this would be somewhere between your Option 1 and Option 2.

i definitely defer to @alamb though.

Thanks for the comments @matthewmturner . I am also new and wouldn't call myself database internal expert :) Yes we have all the functionality ready, the complication is what's the best/most efficient way to implement this. I definitely want to hear more opinions on this.

Do you think it worth having a approximation to unblock the perf benchmark work?

@alamb
Copy link
Contributor

alamb commented Jan 19, 2022

(we should probably bring this discussion into a new ticket, FWIW)

TLDR is I think median is quite complicated to get both correct and high performance.

The key problem is that you basically need to have buffered the entire input stream before you know what the output is

Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable.

To implement an exact median, you could implement an Aggregator like we have for other aggregates that buffers up all the input, sorts it, and produces the median value. It would have terrible memory consumption but is probably reasonably straightforward to implement using existing compute kernels as @matthewmturner suggests

If you wanted to try and reuse existing operators (e.g. LogicalPlan and ExecutionPlan), I think that would be tough.

The mapping of SQL --> LogicalPlan is formulaic but non trivial. All aggregate functions are treated the same (create the same LogicalPlan nodes, with different expr lists, so if we special case median with a special operator we'll have to sort out how to handle queries like SELECT sum(foo), median(foo) from bar GROUP BY baz which I think will be complicated

Another challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together)

                        ┌──────────┐                     
                        │          │                     
                        │Aggregator│                     
                        │ (Final)  │                     
                        │          │                     
                        │          │                     
                        └──────────┘                     
                              ▲                          
                              │                          
      ┌────────────────┬──────┴────────────────────┐     
      │                │                           │     
      │                │                           │     
      │                │                           │     
┌──────────┐     ┌──────────┐                ┌──────────┐
│          │     │          │                │          │
│Aggregator│     │Aggregator│                │Aggregator│
│(Partial) │     │(Partial) │         ...    │(Partial) │
│          │     │          │                │          │
│          │     │          │                │          │
└──────────┘     └──────────┘                └──────────┘
      ▲                ▲                           ▲     
      │                │                           │     
      └────────────────┴──────┬────────────────────┘     
                              │                          
                              │                          
                        ┌───────────┐                    
                        │           │                    
                        │Repartition│                    
                        │           │                    
                        │           │                    
                        └───────────┘                    
                              ▲                          
                              │                          
                              │                          
                        ┌───────────┐                    
                        │Input      │                    
                        │           │                    
                        └───────────┘                    

@jorgecarleitao
Copy link
Member

In principle we could aggregate the partials via a ListArray<i32, PrimitiveArray<T>>, and merge them by concatenating the results on a single PrimitiveArray<T> per group (and then compute the aggregate that requires all results in the finish).

But I agree that the majority of cases we want an online estimate.

@realno
Copy link
Contributor

realno commented Jan 19, 2022

Let me move the conversation to a new issue. I agree with @jorgecarleitao that we could introduce an API in addition to current Aggregator to provide the functionality - the current sort actually does exactly that. And to @alamb 's point performance and memory consumption would be an issue, we may need to handle spilling to disk. As said, I'll create a new issue for this.

For the sake of this task, I suggest we close with the approximation version, what do you think? @matthewmturner are you comfortable using the approximate version for your benchmark work?

@matthewmturner
Copy link
Contributor Author

@realno practically, yes definitely okay with approximate version.

Within the context of performance benchmark my only concern is that we are able to produce the expected results. For example, i see clickhouse uses medianExact (https://h2oai.github.io/db-benchmark/). It would be nice if there was a way to test and see if this would give the expected result in the benchmarking context. But, at the end of the day approximate would still be an improvement so even if it didnt match that shouldnt be a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants