[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero #29983

leanken-zz · 2020-10-09T07:33:58Z

What changes were proposed in this pull request?

As SPARK-13860 stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.

Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.

Why are the changes needed?

SQL correctness issue.

Does this PR introduce any user-facing change?

Yes. See sql-migration-guide

In Spark 3.1, statistical aggregation function includes std, stddev, stddev_samp, variance, var_samp, skewness, kurtosis, covar_samp, corr will return NULL instead of Double.NaN when DivideByZero occurs during expression evaluation, for example, when stddev_samp applied on a single element set. In Spark version 3.0 and earlier, it will return Double.NaN in such case. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.statisticalAggregate to true.

How was this patch tested?

Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.

SparkQA · 2020-10-09T08:21:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34184/

SparkQA · 2020-10-09T08:39:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34184/

SparkQA · 2020-10-09T09:24:55Z

Test build #129579 has finished for PR 29983 at commit 1e7894d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-10-09T10:24:18Z

Thank you for your contribution, @leanken .
BTW, could you check the UT failure? It looks like relevant.

org.apache.spark.sql.hive.execution.WindowQuerySuite.windowing.q -- 15. testExpressions

dongjoon-hyun · 2020-10-09T10:25:31Z

cc @maropu

maropu · 2020-10-09T11:30:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2775,6 +2775,16 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+  val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR =
+    buildConf("spark.sql.legacy.centralMomentAgg.enabled")


Could you update the migration guide, too?

Looks we don't need the suffix .enabled for following the other legacy configs.

Also, could you move this config close to the other legacy configs?

maropu · 2020-10-09T11:45:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2775,6 +2775,16 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+  val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR =


nit: LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR -> LEGACY_CENTRAL_MOMENT_AGG

maropu · 2020-10-09T11:47:25Z

Thanks for cc, @dongjoon-hyun !

maropu · 2020-10-09T11:57:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When set to true, stddev_samp and var_samp will return Double.NaN, " +
+        "if applied to a set with a single element. Otherwise, will return 0.0, " +
+        "which is aligned with TPCDS standard.")


I think we don't need to describe which is aligned with TPCDS standard. here for user documents.

maropu · 2020-10-09T12:06:47Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -456,25 +456,31 @@ class DataFrameAggregateSuite extends QueryTest
  }

  test("zero moments") {


How about organizing tests like this? (I think it'd better not to update the existing tests as much as possible):

test("zero moments") { withSQLConf(SQLConf.LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR.key -> "true") { // Don't touch the existing tests val input = Seq((1, 2)).toDF("a", "b") checkAnswer( input.agg(stddev($"a"), stddev_samp($"a"), stddev_pop($"a"), variance($"a"), var_samp($"a"), var_pop($"a"), skewness($"a"), kurtosis($"a")), Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN)) checkAnswer( input.agg( expr("stddev(a)"), expr("stddev_samp(a)"), expr("stddev_pop(a)"), expr("variance(a)"), expr("var_samp(a)"), expr("var_pop(a)"), expr("skewness(a)"), expr("kurtosis(a)")), Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN)) } } test("SPARK-13860: xxxx") { // Writes tests for the new behaviour }

maropu · 2020-10-09T12:19:25Z

Really, we need to return 0.0 in the case? Looks PostgreSQL/MySQL returns null instead;

mysql> create table t (v float8);
mysql> insert into t values (1.0);
mysql> SELECT stddev_samp(v) FROM t;
+----------------+
| stddev_samp(v) |
+----------------+
|           NULL |
+----------------+
1 row in set (0.00 sec)


postgres=# create table t (v float8);
postgres=# insert into t values (1.0);
INSERT 0 1
postgres=# \pset null 'null'
Null display is "null".
postgres=# SELECT stddev_samp(v) FROM t;
 stddev_samp 
-------------
        null
(1 row)

leanken-zz · 2020-10-10T01:39:37Z

Thank you for your contribution, @leanken .
BTW, could you check the UT failure? It looks like relevant.
org.apache.spark.sql.hive.execution.WindowQuerySuite.windowing.q -- 15. testExpressions

sure

leanken-zz · 2020-10-10T01:40:38Z

Really, we need to return 0.0 in the case? Looks PostgreSQL/MySQL returns null instead;

mysql> create table t (v float8);
mysql> insert into t values (1.0);
mysql> SELECT stddev_samp(v) FROM t;
+----------------+
| stddev_samp(v) |
+----------------+
|           NULL |
+----------------+
1 row in set (0.00 sec)


postgres=# create table t (v float8);
postgres=# insert into t values (1.0);
INSERT 0 1
postgres=# \pset null 'null'
Null display is "null".
postgres=# SELECT stddev_samp(v) FROM t;
 stddev_samp 
-------------
        null
(1 row)

let me find more doc and see if returning null meet the TPCDS answer Q39. reply you later.

SparkQA · 2020-10-10T03:52:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34213/

SparkQA · 2020-10-10T04:33:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34213/

SparkQA · 2020-10-10T04:59:03Z

Test build #129610 has finished for PR 29983 at commit a853132.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-10T09:56:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34226/

SparkQA · 2020-10-10T10:12:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34226/

SparkQA · 2020-10-10T13:19:15Z

Test build #129622 has finished for PR 29983 at commit 370cad1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-11T03:31:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34239/

SparkQA · 2020-10-11T03:48:48Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34239/

viirya · 2020-10-11T05:38:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("When set to true, central moment aggregation will return Double.NaN " +
+        "if divide by zero occurred during calculation. " +
+        "Otherwise, it will return null")


Can we describe the Spark version of this legacy behavior? E.g., In what versions it returns NaN.

sure. how about adding "before version 3.1.0, it returns NaN by default."

SparkQA · 2020-10-11T07:05:02Z

Test build #129635 has finished for PR 29983 at commit a7a6eac.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

leanken-zz · 2020-10-12T02:11:56Z

@cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-10-12T02:58:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34255/

HyukjinKwon

Seems plausible to me.

SparkQA · 2020-10-12T03:15:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34255/

Change-Id: Ia173d98cde3dee0e9f36dc1e1121879318981590

SparkQA · 2020-10-12T15:39:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34302/

SparkQA · 2020-10-12T16:03:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34302/

Change-Id: I463c1f9696eaf975f0333d6120f749263fbc1592

cloud-fan · 2020-10-12T16:50:27Z

sql/core/src/test/resources/sql-tests/results/udf/postgreSQL/udf-aggregates_part1.sql.out

@@ -141,17 +141,17 @@ struct<var_samp(CAST(CAST(udf(ansi_cast(ansi_cast(b as decimal(38,0)) as string)
 -- !query
 SELECT udf(var_pop(1.0)), var_samp(udf(2.0))
 -- !query schema
-struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double>
+struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double), true) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double>


legacy config is internal and all the functions in one query should be all legacy or not legacy. I think we don't need to display the legacy flag value. We can override stringArgs in these functions (the base classes) to exclude the legacy flag.

This also avoids all the changes to the explain golden files.

SparkQA · 2020-10-12T16:54:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34303/

SparkQA · 2020-10-12T17:18:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34303/

SparkQA · 2020-10-12T18:03:14Z

Test build #129696 has finished for PR 29983 at commit dc8efb6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-12T20:44:20Z

Test build #129697 has finished for PR 29983 at commit 6eee3c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: Ib36d81e7a89b2b6d7867b2448b9b2b599c17e5bb

SparkQA · 2020-10-13T02:38:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34323/

SparkQA · 2020-10-13T02:57:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34323/

SparkQA · 2020-10-13T06:27:38Z

Test build #129717 has finished for PR 29983 at commit 084c3fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

leanken-zz · 2020-10-13T06:30:34Z

@cloud-fan if no further comment, test was passed, ready to merge.

Change-Id: Idc061ac89bb65f1c6a0f20517b2489aaa903a7eb

SparkQA · 2020-10-13T09:16:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34340/

SparkQA · 2020-10-13T09:36:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34340/

SparkQA · 2020-10-13T13:09:02Z

Test build #129734 has finished for PR 29983 at commit ddc522c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-10-13T13:21:44Z

thanks, merging to master!

AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.

AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`. Now the code is incompatible with spark-2.x or spark-3.0, and so I'd like to recommend only supporting spark 3.1 and higher and scala 2.12 from now on.

AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.

Fix documentation travis-ci.org to travis-ci.com link Update anomaly_detection_example.md There is a 60 multiplication missing in the example: 1000 ms = 1 s 60 s = 1 min 60 min = 1 h 24 h = 1 d add constructor option to sort suggested categories update to spark3.1 AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`. Now the code is incompatible with spark-2.x or spark-3.0, and so I'd like to recommend only supporting spark 3.1 and higher and scala 2.12 from now on. Update README.md Update README.md fix pattern match hashcode bug restore empty line fix style change version number for release 2.0.0-spark-3.1 update pom.xml and some analyzers to compile with spark 3.2.0 - tests failing Upgrade to spark 3.2 (awslabs#416) * Use spark 3.2.1 and fix hasCorrelation Check fail * Fix scalastyle fail * Disable spark.sql.adaptive.enabled Co-authored-by: tan.vu <[email protected]> devcontainer Referential Integrity check and test, with Data Synchronization Check and Test remove .DS_Store files Cleaner versions of Referential Integrity and Data Synchronization checks and tests. save save Newest version of my three checks Version for code review, for all of my checks Final code review Pull request version of my code Pull request version of my code Final Version Pull Request remove .DS_Store files Duplicate .DS_Store banished! Removing Removings Delete DS_Stores

maropu reviewed Oct 9, 2020

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-13860][SQL] change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard.~~ [SPARK-13860][SQL] Change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard. Oct 9, 2020

leanken-zz changed the title ~~[SPARK-13860][SQL] Change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard.~~ [SPARK-13860][SQL] Change CentralMomentAgg to return null instead of Double.NaN when divideByZero Oct 10, 2020

viirya reviewed Oct 11, 2020

View reviewed changes

HyukjinKwon reviewed Oct 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

HyukjinKwon reviewed Oct 12, 2020

View reviewed changes

code refine.

dc8efb6

Change-Id: Ia173d98cde3dee0e9f36dc1e1121879318981590

update golden file since nullOnDivideByZero change.

6eee3c9

Change-Id: I463c1f9696eaf975f0333d6120f749263fbc1592

cloud-fan reviewed Oct 12, 2020

View reviewed changes

update stringArgs to avoid change of golden files.

084c3fb

Change-Id: Ib36d81e7a89b2b6d7867b2448b9b2b599c17e5bb

code refine remove dup code.

ddc522c

Change-Id: Idc061ac89bb65f1c6a0f20517b2489aaa903a7eb

cloud-fan closed this in dc697a8 Oct 13, 2020

aviatesk mentioned this pull request Apr 26, 2021

update to spark3.1 aviatesk/deequ#11

Open

aviatesk mentioned this pull request May 24, 2021

update to spark3.1 awslabs/deequ#366

Merged

		@@ -456,25 +456,31 @@ class DataFrameAggregateSuite extends QueryTest
		}

		test("zero moments") {

[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero #29983

[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero #29983

Conversation

leanken-zz commented Oct 9, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 9, 2020

SparkQA commented Oct 9, 2020

SparkQA commented Oct 9, 2020

dongjoon-hyun commented Oct 9, 2020

dongjoon-hyun commented Oct 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Oct 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Oct 9, 2020

leanken-zz commented Oct 10, 2020

leanken-zz commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2020

leanken-zz commented Oct 12, 2020

SparkQA commented Oct 12, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 12, 2020

SparkQA commented Oct 12, 2020

SparkQA commented Oct 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 12, 2020

SparkQA commented Oct 12, 2020

SparkQA commented Oct 12, 2020

SparkQA commented Oct 12, 2020

SparkQA commented Oct 13, 2020

SparkQA commented Oct 13, 2020

SparkQA commented Oct 13, 2020

leanken-zz commented Oct 13, 2020

SparkQA commented Oct 13, 2020

SparkQA commented Oct 13, 2020

SparkQA commented Oct 13, 2020

cloud-fan commented Oct 13, 2020

leanken-zz commented Oct 9, 2020 •

edited

Loading