Implement `kurtosis_pop` UDAF #12273

goldmedal · 2024-08-31T13:10:38Z

Which issue does this PR close?

Closes #12251 .

Rationale for this change

I followed the algorithm of the DuckDB implementation to implement this function. The behavior is the same but there are some precision issues for the double value.

I guess that it's also a part of #12250.

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

goldmedal · 2024-08-31T13:57:54Z

datafusion/sqllogictest/test_files/aggregate.slt

+query R
+SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col);
+----
+0.194323231917


I tried this function with the CLI

DataFusion CLI v41.0.0 > SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col); +-----------------------+ | kurtosis_pop(tab.col) | +-----------------------+ | 0.19432323191699075 | +-----------------------+

I'm not sure but I guess the sqllogicttest may do some rounds for the result.

Yes sqllogictest will do the rounding according to https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

floating point values are rounded to the scale of "12",

goldmedal · 2024-08-31T14:17:40Z

datafusion/sqllogictest/test_files/aggregate.slt

+# The result is -1.153061224489787 actually
+query R
+SELECT kurtosis_pop(col) FROM VALUES (1), (2), (3), (2), (1) as tab(col);
+----
+-1.15306122449


This result is different from DuckDB but I'm not sure why.

D SELECT kurtosis_pop(col) FROM VALUES (1), (2), (3), (2), (1) as tab(col); ┌────────────────────┐ │ kurtosis_pop(col) │ │ double │ ├────────────────────┤ │ -1.153061224489769 │ └────────────────────┘

goldmedal · 2024-08-31T15:01:35Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+        let count_64 = 1_f64 / self.count as f64;
+        let m4 = count_64
+            * (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64
+                + 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2)
+                - 3.0 * self.sum.powi(4) * count_64.powi(3));


I followed the DuckDB way to get the divisor here.

https://github.com/duckdb/duckdb/blob/a706958d15a6fc7fd47d65d22de7deac63613458/src/core_functions/aggregate/distributive/kurtosis.cpp#L69

The result will same as DuckDB but it's different from Clickhouse.

I did some test to compare the behavior between DuckDB and Clickhouse:

DuckDB

D SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col); ┌─────────────────────┐ │ kurtosis_pop(col) │ │ double │ ├─────────────────────┤ │ 0.19432323191699075 │ └─────────────────────┘

Clickhouse

:) SELECT kurtPop(value) FROM (SELECT arrayJoin([1, 10, 100, 10, 1]) AS value); SELECT kurtPop(value) FROM ( SELECT arrayJoin([1, 10, 100, 10, 1]) AS value ) Query id: abdea377-40b1-4437-a87a-4814f11cc866 ┌─────kurtPop(value)─┐ 1. │ 3.1943232319169903 │ └────────────────────┘ 1 row in set. Elapsed: 0.002 sec.

Because DuckDB's kurtosis_pop calculates the population kurtosis using Fisher's definition, which results in the excess kurtosis, i.e., the value minus 3, ClickHouse directly provides the population kurtosis value without subtracting 3.

However, if we change the code like

Suggested change

let count_64 = 1_f64 / self.count as f64;

let m4 = count_64

* (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64

+ 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2)

- 3.0 * self.sum.powi(4) * count_64.powi(3));

let count_64 = self.count as f64;

let m4 =

(self.sum_four - 4.0 * self.sum_cub * self.sum / count_64

+ 6.0 * self.sum_sqr * self.sum.powi(2) / count_64.powi(2)

- 3.0 * self.sum.powi(4) / count_64.powi(3)) / count_64;

The result will same as Clikhouse, 3.1943232319169903 - 3 = 0.1943232319169903

We could follow DuckDB in this case

2010YOUY01

Thank you, the implementation looks good to me.
I think it's a good idea to follow DuckDB's behavior

One thing to do is to update the function doc also https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md

2010YOUY01 · 2024-09-01T03:24:53Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+    }
+}
+
+impl Accumulator for KurtosisPopAccumulator {


It would be great to add a link to the algorithm (something like wikipedia or duckdb's implementation)

Thanks for reminding this. I have added the doc for KurtosisPopAccumulator and updated the function doc.

jayzhan211 · 2024-09-01T08:08:48Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+impl KurtosisPopFunction {
+    pub fn new() -> Self {
+        Self {
+            signature: Signature::numeric(1, Volatility::Immutable),


I think user_defined with Float64 is more suitable here.

fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> { Ok(vec![DataType::Float64]) }

@goldmedal If we handle the coercion before the function, they will be coerced to f64, therefore we just need to deal with f64 only

You can take #12275 as reference, we can have signature coercible(vec![Float64]) if this function expect any type that is coercible to f64.

It looks great! Thanks

jayzhan211 · 2024-09-01T07:41:54Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+    }
+
+    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
+        if !arg_types[0].is_null() && !arg_types[0].is_numeric() {


I guess we don't require additional check

Indeed, we have the mechanism to coerce the type implicitly. The case will work after this check is removed.

query R SELECT kurtosis_pop(col) FROM VALUES ('1'), ('10'), ('100'), ('10'), ('1') as tab(col); ---- 0.194323231917

jayzhan211 · 2024-09-01T07:44:10Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+
+impl Accumulator for KurtosisPopAccumulator {
+    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
+        let values = &cast(&values[0], &DataType::Float64)?;


Suggested change

let values = &cast(&values[0], &DataType::Float64)?;

let array = values[0].as_primitive::<Float64Type>();

for value in array.iter().flatten() {

self.count += 1;

self.sum += value;

self.sum_sqr += value.powi(2);

self.sum_cub += value.powi(3);

self.sum_four += value.powi(4);

}

you can also use as_float64_array or as_primitive_opt if you prefer Result than panic.

It looks good. I prefer to use as_float64_array. However, I think the &cast can't be removed. We should cast from another type array to the float64 array first, then downcast to Float64Array by as_float64_array.

jayzhan211 · 2024-09-01T07:49:22Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+        let count_64 = 1_f64 / self.count as f64;
+        let m4 = count_64
+            * (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64
+                + 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2)
+                - 3.0 * self.sum.powi(4) * count_64.powi(3));


We could follow DuckDB in this case

jayzhan211 · 2024-09-04T02:20:18Z

datafusion/functions-aggregate/src/kurtosis_pop.rs

+
+impl Accumulator for KurtosisPopAccumulator {
+    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
+        let values = &cast(&values[0], &DataType::Float64)?;


I think we don't need the cast here? 🤔
The coercion is handled in Signature::Coercible

Amazing! Thanks for the suggestion.

jayzhan211

👍

jayzhan211 · 2024-09-04T07:19:34Z

Thanks @goldmedal @2010YOUY01

goldmedal · 2024-09-04T07:20:59Z

Thanks @jayzhan211 @2010YOUY01

alamb · 2024-09-25T15:52:25Z

I filed #12625 to propose moving kertosis_pop

goldmedal added 2 commits August 31, 2024 21:07

implement kurtosis_pop udaf

2a9fc03

add tests

13b00c5

github-actions bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions labels Aug 31, 2024

goldmedal commented Aug 31, 2024

View reviewed changes

goldmedal added 4 commits August 31, 2024 22:19

add empty end line

49d2dc6

fix MSRV check

901d38b

fix the null input and enhance tests

2a49331

refactor the aggregation

31e48c9

goldmedal commented Aug 31, 2024

View reviewed changes

goldmedal marked this pull request as ready for review August 31, 2024 15:10

2010YOUY01 approved these changes Sep 1, 2024

View reviewed changes

jayzhan211 reviewed Sep 1, 2024

View reviewed changes

goldmedal added 2 commits September 1, 2024 23:12

address the review comments

709cc0f

add the doc for kurtois_pop

dffe838

github-actions bot added the documentation Improvements or additions to documentation label Sep 1, 2024

goldmedal added 3 commits September 2, 2024 00:06

fix the doc style

d51d16e

Merge branch 'main' into feature/12251-kurtosis_pop

ff3863d

use coercible signature

4f06d04

jayzhan211 reviewed Sep 4, 2024

View reviewed changes

remove unused cast

b56de3f

jayzhan211 approved these changes Sep 4, 2024

View reviewed changes

jayzhan211 merged commit 5ff5a6c into apache:main Sep 4, 2024
25 checks passed

goldmedal deleted the feature/12251-kurtosis_pop branch September 4, 2024 07:20

jatin510 mentioned this pull request Sep 25, 2024

implement kurtosis udaf #12613

Closed

alamb mentioned this pull request Sep 25, 2024

Move kurtosis_pop to datafusion-functions-extra and out ofcore #12625

Closed

jatin510 mentioned this pull request Sep 25, 2024

kurtosis udaf datafusion-contrib/datafusion-functions-extra#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `kurtosis_pop` UDAF #12273

Implement `kurtosis_pop` UDAF #12273

goldmedal commented Aug 31, 2024 •

edited

Loading

goldmedal Aug 31, 2024

2010YOUY01 Sep 1, 2024

goldmedal Aug 31, 2024

goldmedal Aug 31, 2024

jayzhan211 Sep 1, 2024

2010YOUY01 left a comment

2010YOUY01 Sep 1, 2024

goldmedal Sep 1, 2024

This comment was marked as outdated.

jayzhan211 Sep 1, 2024

jayzhan211 Sep 2, 2024

jayzhan211 Sep 3, 2024

goldmedal Sep 3, 2024

jayzhan211 Sep 1, 2024

goldmedal Sep 1, 2024

jayzhan211 Sep 1, 2024

jayzhan211 Sep 1, 2024

goldmedal Sep 1, 2024 •

edited

Loading

jayzhan211 Sep 1, 2024

jayzhan211 Sep 4, 2024

goldmedal Sep 4, 2024

jayzhan211 left a comment

jayzhan211 commented Sep 4, 2024

goldmedal commented Sep 4, 2024

alamb commented Sep 25, 2024

-        let values = &cast(&values[0], &DataType::Float64)?;
+        let array = values[0].as_primitive::<Float64Type>();
+        for value in array.iter().flatten() {
+            self.count += 1;
+            self.sum += value;
+            self.sum_sqr += value.powi(2);
+            self.sum_cub += value.powi(3);
+            self.sum_four += value.powi(4);
+        }

Implement kurtosis_pop UDAF #12273

Implement kurtosis_pop UDAF #12273

Conversation

goldmedal commented Aug 31, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DuckDB

Clickhouse

Choose a reason for hiding this comment

2010YOUY01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

jayzhan211 commented Sep 4, 2024

goldmedal commented Sep 4, 2024

alamb commented Sep 25, 2024

Implement `kurtosis_pop` UDAF #12273

Implement `kurtosis_pop` UDAF #12273

goldmedal commented Aug 31, 2024 •

edited

Loading

goldmedal Sep 1, 2024 •

edited

Loading