Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #21052

Closed

Conversation

mshtelma
Copy link

@mshtelma mshtelma commented Apr 12, 2018

What changes were proposed in this pull request?

During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.

How was this patch tested?

In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.

Mykhailo Shtelma added 3 commits April 12, 2018 09:42
…ision by zero can occur. In order to fix this, check was added.
… IN conditions, if the source table is empty, division by zero can occur. In order to fix this, check was added.
…ich were not satisfied) is queried and CBO is turned on, wrong statistics is used, which leads to ClassCastException in FilterEstimation.evaluateInSet
@mshtelma
Copy link
Author

Regarding the devision by zero in EstimationUtils.scala#L166, I was not able to reproduce it here. (

)
I can add check there too, in order to be really sure, that this never happens.

@maropu
Copy link
Member

maropu commented Apr 12, 2018

@wzhfy @gatorsmile could you trigger the tests?

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @wzhfy Please review this.

@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89316 has finished for PR 21052 at commit 74b6ebd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor

wzhfy commented Apr 13, 2018

@mshtelma Usually we describe PR using two sections: What changes were proposed in this pull request? and How was this patch tested?. I think it should be in the template when we open a PR. Could you please update PR description based on the template?

@wzhfy
Copy link
Contributor

wzhfy commented Apr 13, 2018

retest this please

test("evaluateInSet with all zeros") {
validateEstimatedStats(
Filter(InSet(attrString, Set(3, 4, 5)),
StatsTestPlan(Seq(attrString), 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change rowCount from 10 to 0? this is more reasonable for an empty table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this makes sense.
done

Filter(InSet(attrString, Set(3, 4, 5)),
StatsTestPlan(Seq(attrString), 10,
AttributeMap(Seq(attrString ->
ColumnStat(distinctCount = Some(0), min = Some(0), max = Some(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and max should be None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


import testImplicits._

test("Simple queries must be working, if CBO is turned on") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move it to StatisticsCollectionSuite?
And I think a simple EXPLAIN command on an empty table can just cover the case? We can check the plan's stats (e.g. rowCount == 0) after explain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the test to StatisticsCollectionSuite
Done

val validQuerySet = hSet.filter { v =>
v != null && statsInterval.contains(Literal(v, dataType))
}
if (colStat.min.isDefined && colStat.max.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check ndv == 0 at the beginning and return Some(0.0? then we don't have to make all these changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removes the bigger if, and implemented all three checks with one small if

2)Reduced number of changed lines in FilterEstomation.evaluateInSet
@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89339 has finished for PR 21052 at commit 74b6ebd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89349 has finished for PR 21052 at commit 0faa789.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mshtelma mshtelma changed the title [SPARK-23799] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics [SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics Apr 18, 2018
@mshtelma
Copy link
Author

@wzhfy @maropu Hi guys, is there anything else I should add/change to the PR ?

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89521 has finished for PR 21052 at commit 0faa789.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mshtelma
Copy link
Author

@gatorsmile the failed tests are not connected to the changes introduced in this PR. Would it make sense to run the test again ?

@maropu
Copy link
Member

maropu commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89596 has finished for PR 21052 at commit 0faa789.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mshtelma
Copy link
Author

mshtelma commented Apr 20, 2018

@maropu Thank you!
this time, it were completely different tests (HiveClientSuites) that have failed.

@kiszk
Copy link
Member

kiszk commented Apr 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89656 has finished for PR 21052 at commit 0faa789.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
}
}
}

test("Simple queries must be working, if CBO is turned on") {
withSQLConf(("spark.sql.cbo.enabled", "true")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: withSQLConf(SQLConf.CBO_ENABLED.key -> "true")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain()
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop this line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.bucketBy(10, "id", "FLD1", "FLD2")
.sortBy("id", "FLD1", "FLD2")
.saveAsTable("TBL")
spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need the spark. prefix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

WHERE t1.fld3 IN (-123.23,321.23)
""".stripMargin)
df2.createTempView("TBL2")
spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this explain() called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy has suggested calling explain in order to trigger query optimization and calling FilterEstimation.evaluateInSet method.
I can call collect() instead.
I think explain() is sufficient for this test.

@mshtelma
Copy link
Author

@maropu thank you for the suggestions! I have implemented them and pushed the changes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89675 has finished for PR 21052 at commit 8d21488.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

WHERE t1.fld3 IN (-123.23,321.23)
""".stripMargin)
df2.createTempView("TBL2")
sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe') ").explain()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use explain(). It will output the strings to the console. You can just do this:

sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')").queryExecution.executedPlan

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

FROM tbl t1
JOIN tbl t2 on t1.id=t2.id
WHERE t1.fld3 IN (-123.23,321.23)
""".stripMargin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

          """
            |SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
            |FROM tbl t1
            |JOIN tbl t2 on t1.id=t2.id
            |WHERE  t1.fld3 IN (-123.23,321.23)
          """.stripMargin)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@gatorsmile
Copy link
Member

LGTM except two minor comments.

@mshtelma
Copy link
Author

@gatorsmile I have removed explain() and changed formatting

@SparkQA
Copy link

SparkQA commented Apr 22, 2018

Test build #89684 has finished for PR 21052 at commit 8369cbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Apr 22, 2018
…y zero in a case of empty table with analyzed statistics

>What changes were proposed in this pull request?

During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max  are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.

>How was this patch tested?

In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above,  and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.

Author: Mykhailo Shtelma <[email protected]>
Author: smikesh <[email protected]>

Closes #21052 from mshtelma/filter_estimation_evaluateInSet_Bugs.

(cherry picked from commit c48085a)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in c48085a Apr 22, 2018
@tdas
Copy link
Contributor

tdas commented Apr 23, 2018

@gatorsmile
Copy link
Member

Let me revert it from Spark 2.3

@gatorsmile
Copy link
Member

@mshtelma Could you submit a backport PR to Spark 2.3?

@@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging {
val dataType = attr.dataType
var newNdv = ndv

if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why colStat.min.isEmpty || colStat.max.isEmpty means empty output? string type always has no max/min

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to correct it in the next PR

@mshtelma
Copy link
Author

@gatorsmile should I create new PR with these changes for 2.3 branch ? I will do this. Do we need new jira for 2.3 ? or should I reference the existing one ?

@gatorsmile
Copy link
Member

See my PR #21147. We need to fix the issue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants