-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero #29983
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129579 has finished for PR 29983 at commit
|
Thank you for your contribution, @leanken .
|
cc @maropu |
@@ -2775,6 +2775,16 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR = | |||
buildConf("spark.sql.legacy.centralMomentAgg.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the migration guide, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks we don't need the suffix .enabled
for following the other legacy configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you move this config close to the other legacy configs?
@@ -2775,6 +2775,16 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR
-> LEGACY_CENTRAL_MOMENT_AGG
Thanks for cc, @dongjoon-hyun ! |
.internal() | ||
.doc("When set to true, stddev_samp and var_samp will return Double.NaN, " + | ||
"if applied to a set with a single element. Otherwise, will return 0.0, " + | ||
"which is aligned with TPCDS standard.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to describe which is aligned with TPCDS standard.
here for user documents.
@@ -456,25 +456,31 @@ class DataFrameAggregateSuite extends QueryTest | |||
} | |||
|
|||
test("zero moments") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about organizing tests like this? (I think it'd better not to update the existing tests as much as possible):
test("zero moments") {
withSQLConf(SQLConf.LEGACY_CENTRAL_MOMENT_AGG_BEHAVIOR.key -> "true") {
// Don't touch the existing tests
val input = Seq((1, 2)).toDF("a", "b")
checkAnswer(
input.agg(stddev($"a"), stddev_samp($"a"), stddev_pop($"a"), variance($"a"),
var_samp($"a"), var_pop($"a"), skewness($"a"), kurtosis($"a")),
Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
Double.NaN, Double.NaN))
checkAnswer(
input.agg(
expr("stddev(a)"),
expr("stddev_samp(a)"),
expr("stddev_pop(a)"),
expr("variance(a)"),
expr("var_samp(a)"),
expr("var_pop(a)"),
expr("skewness(a)"),
expr("kurtosis(a)")),
Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
Double.NaN, Double.NaN))
}
}
test("SPARK-13860: xxxx") {
// Writes tests for the new behaviour
}
Really, we need to return 0.0 in the case? Looks PostgreSQL/MySQL returns null instead;
|
sure |
let me find more doc and see if returning null meet the TPCDS answer Q39. reply you later. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #129610 has finished for PR 29983 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129622 has finished for PR 29983 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
.doc("When set to true, central moment aggregation will return Double.NaN " + | ||
"if divide by zero occurred during calculation. " + | ||
"Otherwise, it will return null") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we describe the Spark version of this legacy behavior? E.g., In what versions it returns NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. how about adding "before version 3.1.0, it returns NaN by default."
Test build #129635 has finished for PR 29983 at commit
|
Kubernetes integration test starting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems plausible to me.
Kubernetes integration test status success |
Change-Id: Ia173d98cde3dee0e9f36dc1e1121879318981590
Kubernetes integration test starting |
Kubernetes integration test status success |
Change-Id: I463c1f9696eaf975f0333d6120f749263fbc1592
@@ -141,17 +141,17 @@ struct<var_samp(CAST(CAST(udf(ansi_cast(ansi_cast(b as decimal(38,0)) as string) | |||
-- !query | |||
SELECT udf(var_pop(1.0)), var_samp(udf(2.0)) | |||
-- !query schema | |||
struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double> | |||
struct<CAST(udf(ansi_cast(var_pop(ansi_cast(1.0 as double), true) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(ansi_cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacy config is internal and all the functions in one query should be all legacy or not legacy. I think we don't need to display the legacy flag value. We can override stringArgs
in these functions (the base classes) to exclude the legacy flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also avoids all the changes to the explain golden files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129696 has finished for PR 29983 at commit
|
Test build #129697 has finished for PR 29983 at commit
|
Change-Id: Ib36d81e7a89b2b6d7867b2448b9b2b599c17e5bb
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129717 has finished for PR 29983 at commit
|
@cloud-fan if no further comment, test was passed, ready to merge. |
Change-Id: Idc061ac89bb65f1c6a0f20517b2489aaa903a7eb
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129734 has finished for PR 29983 at commit
|
thanks, merging to master! |
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`. Now the code is incompatible with spark-2.x or spark-3.0, and so I'd like to recommend only supporting spark 3.1 and higher and scala 2.12 from now on.
AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`.
Fix documentation travis-ci.org to travis-ci.com link Update anomaly_detection_example.md There is a 60 multiplication missing in the example: 1000 ms = 1 s 60 s = 1 min 60 min = 1 h 24 h = 1 d add constructor option to sort suggested categories update to spark3.1 AFAIU the only requirement is update for <apache/spark#29983>. In order to be consistent with the previous behavior and pass the existing test suite, this PR is essentially equavalent to setting `spark.sql.legacy.statisticalAggregate` to `true`. Now the code is incompatible with spark-2.x or spark-3.0, and so I'd like to recommend only supporting spark 3.1 and higher and scala 2.12 from now on. Update README.md Update README.md fix pattern match hashcode bug restore empty line fix style change version number for release 2.0.0-spark-3.1 update pom.xml and some analyzers to compile with spark 3.2.0 - tests failing Upgrade to spark 3.2 (awslabs#416) * Use spark 3.2.1 and fix hasCorrelation Check fail * Fix scalastyle fail * Disable spark.sql.adaptive.enabled Co-authored-by: tan.vu <[email protected]> devcontainer Referential Integrity check and test, with Data Synchronization Check and Test remove .DS_Store files Cleaner versions of Referential Integrity and Data Synchronization checks and tests. save save Newest version of my three checks Version for code review, for all of my checks Final code review Pull request version of my code Pull request version of my code Final Version Pull Request remove .DS_Store files Duplicate .DS_Store banished! Removing Removings Delete DS_Stores
What changes were proposed in this pull request?
As SPARK-13860 stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.
Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.
Why are the changes needed?
SQL correctness issue.
Does this PR introduce any user-facing change?
Yes. See sql-migration-guide
In Spark 3.1, statistical aggregation function includes
std
,stddev
,stddev_samp
,variance
,var_samp
,skewness
,kurtosis
,covar_samp
,corr
will returnNULL
instead ofDouble.NaN
whenDivideByZero
occurs during expression evaluation, for example, whenstddev_samp
applied on a single element set. In Spark version 3.0 and earlier, it will returnDouble.NaN
in such case. To restore the behavior before Spark 3.1, you can setspark.sql.legacy.statisticalAggregate
totrue
.How was this patch tested?
Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.