Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero #29983

Closed
wants to merge 14 commits into from

Conversation

leanken-zz
Copy link
Contributor

@leanken-zz leanken-zz commented Oct 9, 2020

What changes were proposed in this pull request?

As SPARK-13860 stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.

Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.

Why are the changes needed?

SQL correctness issue.

Does this PR introduce any user-facing change?

Yes. See sql-migration-guide

In Spark 3.1, statistical aggregation function includes std, stddev, stddev_samp, variance, var_samp, skewness, kurtosis, covar_samp, corr will return NULL instead of Double.NaN when DivideByZero occurs during expression evaluation, for example, when stddev_samp applied on a single element set. In Spark version 3.0 and earlier, it will return Double.NaN in such case. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.statisticalAggregate to true.

How was this patch tested?

Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34184/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34184/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129579 has finished for PR 29983 at commit 1e7894d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for your contribution, @leanken .
BTW, could you check the UT failure? It looks like relevant.

org.apache.spark.sql.hive.execution.WindowQuerySuite.windowing.q -- 15. testExpressions

@dongjoon-hyun
Copy link
Member

cc @maropu

@@ -2775,6 +2775,16 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR =
buildConf("spark.sql.legacy.centralMomentAgg.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the migration guide, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we don't need the suffix .enabled for following the other legacy configs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you move this config close to the other legacy configs?

@@ -2775,6 +2775,16 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR -> LEGACY_CENTRAL_MOMENT_AGG

@maropu
Copy link
Member

maropu commented Oct 9, 2020

Thanks for cc, @dongjoon-hyun !

.internal()
.doc("When set to true, stddev_samp and var_samp will return Double.NaN, " +
"if applied to a set with a single element. Otherwise, will return 0.0, " +
"which is aligned with TPCDS standard.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to describe which is aligned with TPCDS standard. here for user documents.

@@ -456,25 +456,31 @@ class DataFrameAggregateSuite extends QueryTest
}

test("zero moments") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about organizing tests like this? (I think it'd better not to update the existing tests as much as possible):

  test("zero moments") {
    withSQLConf(SQLConf.LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR.key -> "true") {
      // Don't touch the existing tests
      val input = Seq((1, 2)).toDF("a", "b")
      checkAnswer(
        input.agg(stddev($"a"), stddev_samp($"a"), stddev_pop($"a"), variance($"a"),
          var_samp($"a"), var_pop($"a"), skewness($"a"), kurtosis($"a")),
        Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
          Double.NaN, Double.NaN))

      checkAnswer(
        input.agg(
          expr("stddev(a)"),
          expr("stddev_samp(a)"),
          expr("stddev_pop(a)"),
          expr("variance(a)"),
          expr("var_samp(a)"),
          expr("var_pop(a)"),
          expr("skewness(a)"),
          expr("kurtosis(a)")),
        Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
          Double.NaN, Double.NaN))
    }
  }

  test("SPARK-13860: xxxx") {
    // Writes tests for the new behaviour
  }

@HyukjinKwon HyukjinKwon changed the title [SPARK-13860][SQL] change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard. [SPARK-13860][SQL] Change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard. Oct 9, 2020
@maropu
Copy link
Member

maropu commented Oct 9, 2020

Really, we need to return 0.0 in the case? Looks PostgreSQL/MySQL returns null instead;

mysql> create table t (v float8);
mysql> insert into t values (1.0);
mysql> SELECT stddev_samp(v) FROM t;
+----------------+
| stddev_samp(v) |
+----------------+
|           NULL |
+----------------+
1 row in set (0.00 sec)


postgres=# create table t (v float8);
postgres=# insert into t values (1.0);
INSERT 0 1
postgres=# \pset null 'null'
Null display is "null".
postgres=# SELECT stddev_samp(v) FROM t;
 stddev_samp 
-------------
        null
(1 row)

@leanken-zz
Copy link
Contributor Author

Thank you for your contribution, @leanken .
BTW, could you check the UT failure? It looks like relevant.

org.apache.spark.sql.hive.execution.WindowQuerySuite.windowing.q -- 15. testExpressions

sure

@leanken-zz
Copy link
Contributor Author

Really, we need to return 0.0 in the case? Looks PostgreSQL/MySQL returns null instead;

mysql> create table t (v float8);
mysql> insert into t values (1.0);
mysql> SELECT stddev_samp(v) FROM t;
+----------------+
| stddev_samp(v) |
+----------------+
|           NULL |
+----------------+
1 row in set (0.00 sec)


postgres=# create table t (v float8);
postgres=# insert into t values (1.0);
INSERT 0 1
postgres=# \pset null 'null'
Null display is "null".
postgres=# SELECT stddev_samp(v) FROM t;
 stddev_samp 
-------------
        null
(1 row)

let me find more doc and see if returning null meet the TPCDS answer Q39. reply you later.

@leanken-zz leanken-zz changed the title [SPARK-13860][SQL] Change stddev_samp and var_samp to return 0.0 instead of Double.NaN to align with TPCDS standard. [SPARK-13860][SQL] Change CentralMomentAgg to return null instead of Double.NaN when divideByZero Oct 10, 2020
@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34213/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34213/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129610 has finished for PR 29983 at commit a853132.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34226/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34226/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129622 has finished for PR 29983 at commit 370cad1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34239/

@SparkQA
Copy link

SparkQA commented Oct 11, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34239/

Comment on lines 2348 to 2350
.doc("When set to true, central moment aggregation will return Double.NaN " +
"if divide by zero occurred during calculation. " +
"Otherwise, it will return null")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe the Spark version of this legacy behavior? E.g., In what versions it returns NaN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. how about adding "before version 3.1.0, it returns NaN by default."

@SparkQA
Copy link

SparkQA commented Oct 11, 2020

Test build #129635 has finished for PR 29983 at commit a7a6eac.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@leanken-zz
Copy link
Contributor Author

@cloud-fan

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34255/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems plausible to me.

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34255/

Change-Id: Ia173d98cde3dee0e9f36dc1e1121879318981590
@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34302/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34302/

Change-Id: I463c1f9696eaf975f0333d6120f749263fbc1592
@@ -141,17 +141,17 @@ struct<var_samp(CAST(CAST(udf(ansi_cast(ansi_cast(b as decimal(38,0)) as string)
-- !query
SELECT udf(var_pop(1.0)), var_samp(udf(2.0))
-- !query schema
struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double>
struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double), true) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy config is internal and all the functions in one query should be all legacy or not legacy. I think we don't need to display the legacy flag value. We can override stringArgs in these functions (the base classes) to exclude the legacy flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also avoids all the changes to the explain golden files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34303/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34303/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Test build #129696 has finished for PR 29983 at commit dc8efb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Test build #129697 has finished for PR 29983 at commit 6eee3c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Change-Id: Ib36d81e7a89b2b6d7867b2448b9b2b599c17e5bb
@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34323/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34323/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129717 has finished for PR 29983 at commit 084c3fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@leanken-zz
Copy link
Contributor Author

@cloud-fan if no further comment, test was passed, ready to merge.

Change-Id: Idc061ac89bb65f1c6a0f20517b2489aaa903a7eb
@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34340/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34340/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129734 has finished for PR 29983 at commit ddc522c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in dc697a8 Oct 13, 2020
aviatesk added a commit to aviatesk/deequ that referenced this pull request Apr 26, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
aviatesk added a commit to aviatesk/deequ that referenced this pull request Apr 26, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
aviatesk added a commit to aviatesk/deequ that referenced this pull request Apr 26, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
aviatesk added a commit to aviatesk/deequ that referenced this pull request May 11, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
aviatesk added a commit to aviatesk/deequ that referenced this pull request May 11, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
aviatesk added a commit to aviatesk/deequ that referenced this pull request May 24, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.

Now the code is incompatible with spark-2.x or spark-3.0, and so I'd
like to recommend only supporting spark 3.1 and higher and scala 2.12
from now on.
aviatesk added a commit to aviatesk/deequ that referenced this pull request May 24, 2021
AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.
dariobig pushed a commit to fergonp/deequ that referenced this pull request Sep 2, 2022
Fix documentation

travis-ci.org to travis-ci.com link

Update anomaly_detection_example.md

There is a 60 multiplication missing in the example:
1000 ms = 1 s
60 s = 1 min
60 min = 1 h
24 h = 1 d

add constructor option to sort suggested categories

update to spark3.1

AFAIU the only requirement is update for <apache/spark#29983>.
In order to be consistent with the previous behavior and pass the
existing test suite, this PR is essentially equavalent to setting
`spark.sql.legacy.statisticalAggregate` to `true`.

Now the code is incompatible with spark-2.x or spark-3.0, and so I'd
like to recommend only supporting spark 3.1 and higher and scala 2.12
from now on.

Update README.md

Update README.md

fix pattern match hashcode bug

restore empty line

fix style

change version number for release 2.0.0-spark-3.1

update pom.xml and some analyzers to compile with spark 3.2.0 - tests failing

Upgrade to spark 3.2 (awslabs#416)

* Use spark 3.2.1 and fix hasCorrelation Check fail

* Fix scalastyle fail

* Disable spark.sql.adaptive.enabled

Co-authored-by: tan.vu <[email protected]>

devcontainer

Referential Integrity check and test, with Data Synchronization Check and Test

remove .DS_Store files

Cleaner versions of Referential Integrity and Data Synchronization checks and tests.

save save

Newest version of my three checks

Version for code review, for all of my checks

Final code review

Pull request version of my code

Pull request version of my code

Final Version Pull Request

remove .DS_Store files Duplicate

.DS_Store banished!

Removing

Removings

Delete DS_Stores
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants