Add median, std, and corr functions #1486

matthewmturner · 2021-12-26T04:00:35Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In the context of adding datafusion to db-benchmark (#147) there are some advanced group by queries that are benchmarked which require median, standard deviation, and correlation functions which datafusion does not currently provide out of the box.

Describe the solution you'd like

Builtin support for median, standard deviation, and correlation functions.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

realno · 2022-01-04T19:55:39Z

I am interested in helping with this - I am very curious about the result of the db-benchmark as well and want to help. Does it make sense to create separate issues for the functions? I started looking into stddev and maybe somebody else want to work on the others. @matthewmturner @houqp

matthewmturner · 2022-01-04T20:00:03Z

@realno happy to have the help :) i actually just updated my PR to get datafusion included in db-benchmark with the current features. hopefully getting close.

separate issues per function sounds good. we can just use this as a tracker issue - ill add a task list to it.

realno · 2022-01-06T23:06:11Z

PR for standard deviation is up for review: #1525

Please let me know who I should add as reviewer.

realno · 2022-01-19T03:12:02Z

I have a update/question for this issue: stddev and corr have been merged few days ago. I took a look at median today and had an initial idea how to implement it. Now the problem is the implementation might be somewhat controversial.

There are some operations needed to calculate median: 1. sort 2. count 3. get nth value. I thought about few options:

Implement a new function/operator - the function will look like sort, we may need to add a new type of expression for it. It feels to be a lot of overhead just for median.

Pros: No mess around plan builder and query parsing logic
Cons: Need to add a specific expression just for median; there may be some duplicate logic regarding sort and count

Reuse some of the existing code through rewriting the logic plan. If we can have sort to keep the total number of record, we can add a new function to rewrite the logical plan to something like FIND(n) | SORT(col). But currently there is no such behavior in the code base.

Pros: Better code reuse; May potentially provide a way to handle compound (high-order) functions, e.g. Function1 -> FunctionA(FunctionB(col))
Cons: Need to modify logic for building plans;

Use approximate algorithms like KLL or t-digest. This way it can fit in existing aggregator API, but the result will be an approximation. There is already a PR Add approx_percentile_cont() aggregation function #1539, we can help merge that then implement median as quantile(0.5).

Pros: Easy to implement; Using existing aggregator API, Much better performance
Cons: The result is an approximation

My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward.

@matthewmturner @alamb

matthewmturner · 2022-01-19T17:13:24Z

@realno you can take this with a grain of salt as I am new to this.

My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had DISTINCT before approx_distinct.

Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:

I suppose this would be somewhere between your Option 1 and Option 2.

i definitely defer to @alamb though.

realno · 2022-01-19T19:32:10Z

@realno you can take this with a grain of salt as I am new to this.

My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had DISTINCT before approx_distinct.

Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality:

sort: https://docs.rs/arrow/latest/arrow/compute/kernels/sort/fn.sort.html

length: https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.len

value: https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value

I suppose this would be somewhere between your Option 1 and Option 2.

i definitely defer to @alamb though.

Thanks for the comments @matthewmturner . I am also new and wouldn't call myself database internal expert :) Yes we have all the functionality ready, the complication is what's the best/most efficient way to implement this. I definitely want to hear more opinions on this.

Do you think it worth having a approximation to unblock the perf benchmark work?

alamb · 2022-01-19T21:18:55Z

(we should probably bring this discussion into a new ticket, FWIW)

TLDR is I think median is quite complicated to get both correct and high performance.

The key problem is that you basically need to have buffered the entire input stream before you know what the output is

Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable.

To implement an exact median, you could implement an Aggregator like we have for other aggregates that buffers up all the input, sorts it, and produces the median value. It would have terrible memory consumption but is probably reasonably straightforward to implement using existing compute kernels as @matthewmturner suggests

If you wanted to try and reuse existing operators (e.g. LogicalPlan and ExecutionPlan), I think that would be tough.

The mapping of SQL --> LogicalPlan is formulaic but non trivial. All aggregate functions are treated the same (create the same LogicalPlan nodes, with different expr lists, so if we special case median with a special operator we'll have to sort out how to handle queries like SELECT sum(foo), median(foo) from bar GROUP BY baz which I think will be complicated

Another challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together)

                        ┌──────────┐                     
                        │          │                     
                        │Aggregator│                     
                        │ (Final)  │                     
                        │          │                     
                        │          │                     
                        └──────────┘                     
                              ▲                          
                              │                          
      ┌────────────────┬──────┴────────────────────┐     
      │                │                           │     
      │                │                           │     
      │                │                           │     
┌──────────┐     ┌──────────┐                ┌──────────┐
│          │     │          │                │          │
│Aggregator│     │Aggregator│                │Aggregator│
│(Partial) │     │(Partial) │         ...    │(Partial) │
│          │     │          │                │          │
│          │     │          │                │          │
└──────────┘     └──────────┘                └──────────┘
      ▲                ▲                           ▲     
      │                │                           │     
      └────────────────┴──────┬────────────────────┘     
                              │                          
                              │                          
                        ┌───────────┐                    
                        │           │                    
                        │Repartition│                    
                        │           │                    
                        │           │                    
                        └───────────┘                    
                              ▲                          
                              │                          
                              │                          
                        ┌───────────┐                    
                        │Input      │                    
                        │           │                    
                        └───────────┘

jorgecarleitao · 2022-01-19T21:30:09Z

In principle we could aggregate the partials via a ListArray<i32, PrimitiveArray<T>>, and merge them by concatenating the results on a single PrimitiveArray<T> per group (and then compute the aggregate that requires all results in the finish).

But I agree that the majority of cases we want an online estimate.

realno · 2022-01-19T23:54:42Z

Let me move the conversation to a new issue. I agree with @jorgecarleitao that we could introduce an API in addition to current Aggregator to provide the functionality - the current sort actually does exactly that. And to @alamb 's point performance and memory consumption would be an issue, we may need to handle spilling to disk. As said, I'll create a new issue for this.

For the sake of this task, I suggest we close with the approximation version, what do you think? @matthewmturner are you comfortable using the approximate version for your benchmark work?

matthewmturner · 2022-01-20T00:11:26Z

@realno practically, yes definitely okay with approximate version.

Within the context of performance benchmark my only concern is that we are able to produce the expected results. For example, i see clickhouse uses medianExact (https://h2oai.github.io/db-benchmark/). It would be nice if there was a way to test and see if this would give the expected result in the benchmarking context. But, at the end of the day approximate would still be an improvement so even if it didnt match that shouldnt be a blocker.

matthewmturner added the enhancement New feature or request label Dec 26, 2021

houqp added the help wanted Extra attention is needed label Dec 26, 2021

realno mentioned this issue Jan 6, 2022

Add stddev and variance #1525

Merged

realno mentioned this issue Jan 14, 2022

Add corr aggregate function #1561

Merged

realno mentioned this issue Feb 2, 2022

Add approx_median() aggregate function #1729

Merged

alamb closed this as completed in #1729 Feb 9, 2022

andygrove mentioned this issue Jul 16, 2022

Add support for exact median aggregate function #2925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add median, std, and corr functions #1486

Add median, std, and corr functions #1486

matthewmturner commented Dec 26, 2021 •

edited

Loading

realno commented Jan 4, 2022

matthewmturner commented Jan 4, 2022

realno commented Jan 6, 2022

realno commented Jan 19, 2022 •

edited

Loading

matthewmturner commented Jan 19, 2022

realno commented Jan 19, 2022

alamb commented Jan 19, 2022 •

edited

Loading

jorgecarleitao commented Jan 19, 2022

realno commented Jan 19, 2022

matthewmturner commented Jan 20, 2022

Add median, std, and corr functions #1486

Add median, std, and corr functions #1486

Comments

matthewmturner commented Dec 26, 2021 • edited Loading

realno commented Jan 4, 2022

matthewmturner commented Jan 4, 2022

realno commented Jan 6, 2022

realno commented Jan 19, 2022 • edited Loading

matthewmturner commented Jan 19, 2022

realno commented Jan 19, 2022

alamb commented Jan 19, 2022 • edited Loading

jorgecarleitao commented Jan 19, 2022

realno commented Jan 19, 2022

matthewmturner commented Jan 20, 2022

matthewmturner commented Dec 26, 2021 •

edited

Loading

realno commented Jan 19, 2022 •

edited

Loading

alamb commented Jan 19, 2022 •

edited

Loading