-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #21052
Conversation
…ision by zero can occur. In order to fix this, check was added.
… IN conditions, if the source table is empty, division by zero can occur. In order to fix this, check was added.
…ich were not satisfied) is queried and CBO is turned on, wrong statistics is used, which leads to ClassCastException in FilterEstimation.evaluateInSet
Regarding the devision by zero in EstimationUtils.scala#L166, I was not able to reproduce it here. ( Line 166 in 5cfd5fa
I can add check there too, in order to be really sure, that this never happens. |
@wzhfy @gatorsmile could you trigger the tests? |
ok to test |
cc @wzhfy Please review this. |
Test build #89316 has finished for PR 21052 at commit
|
@mshtelma Usually we describe PR using two sections: |
retest this please |
test("evaluateInSet with all zeros") { | ||
validateEstimatedStats( | ||
Filter(InSet(attrString, Set(3, 4, 5)), | ||
StatsTestPlan(Seq(attrString), 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change rowCount from 10
to 0
? this is more reasonable for an empty table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this makes sense.
done
Filter(InSet(attrString, Set(3, 4, 5)), | ||
StatsTestPlan(Seq(attrString), 10, | ||
AttributeMap(Seq(attrString -> | ||
ColumnStat(distinctCount = Some(0), min = Some(0), max = Some(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min
and max
should be None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
import testImplicits._ | ||
|
||
test("Simple queries must be working, if CBO is turned on") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move it to StatisticsCollectionSuite
?
And I think a simple EXPLAIN command on an empty table can just cover the case? We can check the plan's stats (e.g. rowCount == 0) after explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the test to StatisticsCollectionSuite
Done
val validQuerySet = hSet.filter { v => | ||
v != null && statsInterval.contains(Literal(v, dataType)) | ||
} | ||
if (colStat.min.isDefined && colStat.max.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check ndv == 0
at the beginning and return Some(0.0
? then we don't have to make all these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have removes the bigger if, and implemented all three checks with one small if
2)Reduced number of changed lines in FilterEstomation.evaluateInSet
Test build #89339 has finished for PR 21052 at commit
|
Test build #89349 has finished for PR 21052 at commit
|
retest this please |
Test build #89521 has finished for PR 21052 at commit
|
@gatorsmile the failed tests are not connected to the changes introduced in this PR. Would it make sense to run the test again ? |
retest this please |
Test build #89596 has finished for PR 21052 at commit
|
@maropu Thank you! |
retest this please |
Test build #89656 has finished for PR 21052 at commit
|
@@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared | |||
} | |||
} | |||
} | |||
|
|||
test("Simple queries must be working, if CBO is turned on") { | |||
withSQLConf(("spark.sql.cbo.enabled", "true")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: withSQLConf(SQLConf.CBO_ENABLED.key -> "true")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.bucketBy(10, "id", "FLD1", "FLD2") | ||
.sortBy("id", "FLD1", "FLD2") | ||
.saveAsTable("TBL") | ||
spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need the spark.
prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
WHERE t1.fld3 IN (-123.23,321.23) | ||
""".stripMargin) | ||
df2.createTempView("TBL2") | ||
spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this explain()
called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wzhfy has suggested calling explain in order to trigger query optimization and calling FilterEstimation.evaluateInSet method.
I can call collect() instead.
I think explain() is sufficient for this test.
@maropu thank you for the suggestions! I have implemented them and pushed the changes. |
Test build #89675 has finished for PR 21052 at commit
|
WHERE t1.fld3 IN (-123.23,321.23) | ||
""".stripMargin) | ||
df2.createTempView("TBL2") | ||
sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use explain()
. It will output the strings to the console. You can just do this:
sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')").queryExecution.executedPlan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
FROM tbl t1 | ||
JOIN tbl t2 on t1.id=t2.id | ||
WHERE t1.fld3 IN (-123.23,321.23) | ||
""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
"""
|SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
|FROM tbl t1
|JOIN tbl t2 on t1.id=t2.id
|WHERE t1.fld3 IN (-123.23,321.23)
""".stripMargin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM except two minor comments. |
@gatorsmile I have removed explain() and changed formatting |
Test build #89684 has finished for PR 21052 at commit
|
Thanks! Merged to master/2.3 |
…y zero in a case of empty table with analyzed statistics >What changes were proposed in this pull request? During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException. In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned. >How was this patch tested? In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section. Author: Mykhailo Shtelma <[email protected]> Author: smikesh <[email protected]> Closes #21052 from mshtelma/filter_estimation_evaluateInSet_Bugs. (cherry picked from commit c48085a) Signed-off-by: gatorsmile <[email protected]>
Let me revert it from Spark 2.3 |
@mshtelma Could you submit a backport PR to Spark 2.3? |
@@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging { | |||
val dataType = attr.dataType | |||
var newNdv = ndv | |||
|
|||
if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why colStat.min.isEmpty || colStat.max.isEmpty
means empty output? string type always has no max/min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to correct it in the next PR
@gatorsmile should I create new PR with these changes for 2.3 branch ? I will do this. Do we need new jira for 2.3 ? or should I reference the existing one ? |
See my PR #21147. We need to fix the issue first. |
During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.
In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.