Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9192][SQL] add initialization phase for nondeterministic expression #7535

Closed
wants to merge 6 commits into from

Conversation

cloud-fan
Copy link
Contributor

Currently nondeterministic expression is broken without a explicit initialization phase.

Let me take MonotonicallyIncreasingID as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use @transient var count: Long there. By being transient, the count will be reset to 0 and only to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is no way to use another initial value for count, until we add the explicit initialization phase.

Another use case is local execution for LocalRelation, there is no serialize and deserialize phase and thus we can't reset mutable states for it.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37828 has finished for PR 7535 at commit 1ef6be0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37829 has finished for PR 7535 at commit db65fcd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Initializable
    • abstract class RDG(seed: Long) extends LeafExpression with Serializable with Initializable

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37841 has finished for PR 7535 at commit 845e01e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Initializable
    • abstract class RDG(seed: Long) extends LeafExpression with Serializable with Initializable

@rxin
Copy link
Contributor

rxin commented Jul 20, 2015

I discussed with @marmbrus on this. Here's the suggestion:

  1. Get rid of Initializable trait, and move initialize into Expression itself, mark it final. Its implementation recursively initialize children, and then calls an internal version for the expression to initialize itself. Mark an initialized flag as true after that.
  2. Add a check in various implementations of nullSafeEval to fail if initialized is false; this helps us catch problems.
  3. A few places we are using Cast expression to do proper casting. Consider moving those out of cast into a cast util class, and call those directly instead since they should be stateless. (Are they always stateless? Are timestamp/etc rely on Calendar?)

@cloud-fan cloud-fan changed the title [SPARK-9192][SQL][WIP] add initialization phase for expression [SPARK-9192][SQL] add initialization phase for nondeterministic expression Jul 22, 2015
}

def checkAnalysis(plan: LogicalPlan): Unit = {
// We transform up and order the rules so as to catch the first possible failure instead
// of the result of cascading resolution failures.
plan.foreachUp {

case operator: LogicalPlan =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case operator: LogicalPlan catches all cases, so why not just use a normal function here instead of pattern match?

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38087 has finished for PR 7535 at commit 4dbed66.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

hi @marmbrus @rxin , I have updated the description and code, can you review it for me?

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1158 has finished for PR 7535 at commit 4dbed66.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

case e: Generator => true
}).nonEmpty
case e: Generator => e
}).length > 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a un-related but small fix: check multiple should use length > 1 instead of nonEmpty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38091 has finished for PR 7535 at commit a055fe0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38176 has finished for PR 7535 at commit 60929bf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38194 has finished for PR 7535 at commit bad9a92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @rxin @marmbrus It's ready for review.

failAnalysis(
s"""nondeterministic expressions are only allowed in Project or Filter, found:
| ${o.expressions.map(_.prettyString).mkString(",")}
|in operator ${operator.simpleString}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not aligned ...

@cloud-fan cloud-fan force-pushed the init branch 3 times, most recently from 62c28a7 to 6c6f332 Compare July 25, 2015 17:02
@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38438 has finished for PR 7535 at commit 82f8859.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38440 has finished for PR 7535 at commit 62c28a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38441 has finished for PR 7535 at commit 6c6f332.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// todo: It's hard to write a general rule to pull out nondeterministic expressions
// from LogicalPlan, currently we only do it for UnaryNode which has same output
// schema with its child.
case p: UnaryNode if p.output == p.child.output && p.expressions.exists(!_.deterministic) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin
Copy link
Contributor

rxin commented Jul 25, 2015

I'm going to merge this one first. Would be great to get a holistic review before 1.5 release on all the random related optimizer/analyzer changes.

@asfgit asfgit closed this in 2c94d0f Jul 25, 2015
ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 10, 2017
…oin Conditions

## What changes were proposed in this pull request?
```
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
```
We will get the following error:
```
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
```
Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR apache#7535).

We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.

### How was this patch tested?
Added a test case

Author: Xiao Li <[email protected]>

Closes apache#17585 from gatorsmile/joinRandCondition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants