[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

cloud-fan · 2015-07-20T10:21:55Z

Currently nondeterministic expression is broken without a explicit initialization phase.

Let me take MonotonicallyIncreasingID as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use @transient var count: Long there. By being transient, the count will be reset to 0 and only to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is no way to use another initial value for count, until we add the explicit initialization phase.

Another use case is local execution for LocalRelation, there is no serialize and deserialize phase and thus we can't reset mutable states for it.

SparkQA · 2015-07-20T15:40:48Z

Test build #37828 has finished for PR 7535 at commit 1ef6be0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-20T15:41:19Z

Test build #37829 has finished for PR 7535 at commit db65fcd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Initializable
- abstract class RDG(seed: Long) extends LeafExpression with Serializable with Initializable

SparkQA · 2015-07-20T16:25:41Z

Test build #37841 has finished for PR 7535 at commit 845e01e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Initializable
- abstract class RDG(seed: Long) extends LeafExpression with Serializable with Initializable

rxin · 2015-07-20T18:54:38Z

I discussed with @marmbrus on this. Here's the suggestion:

Get rid of Initializable trait, and move initialize into Expression itself, mark it final. Its implementation recursively initialize children, and then calls an internal version for the expression to initialize itself. Mark an initialized flag as true after that.
Add a check in various implementations of nullSafeEval to fail if initialized is false; this helps us catch problems.
A few places we are using Cast expression to do proper casting. Consider moving those out of cast into a cast util class, and call those directly instead since they should be stateless. (Are they always stateless? Are timestamp/etc rely on Calendar?)

cloud-fan · 2015-07-22T16:07:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

  }

  def checkAnalysis(plan: LogicalPlan): Unit = {
    // We transform up and order the rules so as to catch the first possible failure instead
    // of the result of cascading resolution failures.
-    plan.foreachUp {
-
-      case operator: LogicalPlan =>


case operator: LogicalPlan catches all cases, so why not just use a normal function here instead of pattern match?

SparkQA · 2015-07-22T16:08:51Z

Test build #38087 has finished for PR 7535 at commit 4dbed66.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-22T16:09:53Z

hi @marmbrus @rxin , I have updated the description and code, can you review it for me?

SparkQA · 2015-07-22T16:35:03Z

Test build #1158 has finished for PR 7535 at commit 4dbed66.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-22T16:49:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-      case e: Generator => true
-    }).nonEmpty
+      case e: Generator => e
+    }).length > 1


a un-related but small fix: check multiple should use length > 1 instead of nonEmpty

SparkQA · 2015-07-22T18:29:07Z

Test build #38091 has finished for PR 7535 at commit a055fe0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T07:37:56Z

Test build #38176 has finished for PR 7535 at commit 60929bf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T10:01:03Z

Test build #38194 has finished for PR 7535 at commit bad9a92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-24T05:05:18Z

cc @rxin @marmbrus It's ready for review.

rxin · 2015-07-25T08:03:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            failAnalysis(
+              s"""nondeterministic expressions are only allowed in Project or Filter, found:
+                 | ${o.expressions.map(_.prettyString).mkString(",")}
+                  |in operator ${operator.simpleString}


this is not aligned ...

SparkQA · 2015-07-25T18:15:38Z

Test build #38438 has finished for PR 7535 at commit 82f8859.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-25T18:43:23Z

Test build #38440 has finished for PR 7535 at commit 62c28a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-25T18:52:35Z

Test build #38441 has finished for PR 7535 at commit 6c6f332.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-25T19:08:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // todo: It's hard to write a general rule to pull out nondeterministic expressions
+      // from LogicalPlan, currently we only do it for UnaryNode which has same output
+      // schema with its child.
+      case p: UnaryNode if p.output == p.child.output && p.expressions.exists(!_.deterministic) =>


cc @marmbrus / @yhuai

rxin · 2015-07-25T19:09:30Z

I'm going to merge this one first. Would be great to get a holistic review before 1.5 release on all the random related optimizer/analyzer changes.

…oin Conditions ## What changes were proposed in this pull request? ``` sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show() ``` We will get the following error: ``` Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) ``` Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR apache#7535). We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it. ### How was this patch tested? Added a test case Author: Xiao Li <[email protected]> Closes apache#17585 from gatorsmile/joinRandCondition.

cloud-fan force-pushed the init branch from 1ef6be0 to db65fcd Compare July 20, 2015 10:23

cloud-fan changed the title ~~[SPARK-9192][SQL][WIP] add initialization phase for expression~~ [SPARK-9192][SQL] add initialization phase for nondeterministic expression Jul 22, 2015

cloud-fan force-pushed the init branch from 845e01e to 4dbed66 Compare July 22, 2015 16:05

cloud-fan reviewed Jul 22, 2015
View reviewed changes

rxin reviewed Jul 25, 2015
View reviewed changes

cloud-fan added 5 commits July 26, 2015 00:26

add initialization phase for nondeterministic expression

86fee36

revert a refactor

b4a4fc7

pulls out nondeterministic expressions into a project

bb7d838

move init code to interpreted class

9eac85e

fix comments

ef68ff4

cloud-fan force-pushed the init branch 3 times, most recently from 62c28a7 to 6c6f332 Compare July 25, 2015 17:02

add test

6c6f332

rxin reviewed Jul 25, 2015
View reviewed changes

asfgit closed this in 2c94d0f Jul 25, 2015

gatorsmile mentioned this pull request Apr 10, 2017

[SPARK-20273] [SQL] Disallow Non-deterministic Filter push-down into Join Conditions #17585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

cloud-fan commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

rxin commented Jul 20, 2015

cloud-fan Jul 22, 2015

SparkQA commented Jul 22, 2015

cloud-fan commented Jul 22, 2015

SparkQA commented Jul 22, 2015

cloud-fan Jul 22, 2015

rxin Jul 25, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

cloud-fan commented Jul 24, 2015

rxin Jul 25, 2015

SparkQA commented Jul 25, 2015

SparkQA commented Jul 25, 2015

SparkQA commented Jul 25, 2015

rxin Jul 25, 2015

rxin commented Jul 25, 2015

[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

Conversation

cloud-fan commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

SparkQA commented Jul 20, 2015

rxin commented Jul 20, 2015

cloud-fan Jul 22, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2015

cloud-fan commented Jul 22, 2015

SparkQA commented Jul 22, 2015

cloud-fan Jul 22, 2015

Choose a reason for hiding this comment

rxin Jul 25, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

cloud-fan commented Jul 24, 2015

rxin Jul 25, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 25, 2015

SparkQA commented Jul 25, 2015

SparkQA commented Jul 25, 2015

rxin Jul 25, 2015

Choose a reason for hiding this comment

rxin commented Jul 25, 2015