[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

davies · 2017-01-13T20:04:03Z

What changes were proposed in this pull request?

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.

How was this patch tested?

Add a regression test.

davies · 2017-01-13T20:04:27Z

cc @hvanhovell

SparkQA · 2017-01-13T20:09:44Z

Test build #71346 has finished for PR 16581 at commit 95d73fc.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-14T01:27:44Z

Test build #71349 has finished for PR 16581 at commit c45126b.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-14T07:43:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+      case e: SubqueryExpression =>
+        // non-correlated subquery will be replaced as literal
+        e.children.nonEmpty
+      case e: Unevaluable => true


we need more documentation here on why should be considered evaluable as a join condition.

for example, just looking at this code i have no idea why Uneavaluable is evaluable.

Unevaluable is not evaluable. This block tries to find a case that is not evaluable in a join, and then negates it by isEmpty. I have to admit that we should document this.

rxin · 2017-01-14T07:43:47Z

python/pyspark/sql/tests.py

@@ -342,6 +342,14 @@ def test_udf_in_filter_on_top_of_outer_join(self):
        df = df.withColumn('b', udf(lambda x: 'x')(df.a))
        self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')])

+    def test_udf_in_filter_on_top_of_join(self):


should reference jira number

davies · 2017-01-17T17:51:23Z

python/pyspark/sql/tests.py

@@ -342,6 +342,15 @@ def test_udf_in_filter_on_top_of_outer_join(self):
        df = df.withColumn('b', udf(lambda x: 'x')(df.a))
        self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')])

+    def test_udf_in_filter_on_top_of_join(self):


SparkQA · 2017-01-17T22:03:46Z

Test build #71520 has finished for PR 16581 at commit e4db820.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T06:39:44Z

Test build #3541 has started for PR 16581 at commit e4db820.

SparkQA · 2017-01-19T20:06:06Z

Test build #3542 has finished for PR 16581 at commit d6bba37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T20:09:17Z

Test build #71671 has finished for PR 16581 at commit d6bba37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-20T02:59:35Z

Test build #71681 has finished for PR 16581 at commit f720c85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-01-21T00:11:00Z

LGTM - merging to master.

hvanhovell · 2017-01-21T00:12:19Z

I cannot backport it, could open a PR for 2.1.

… of join PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan. This PR fix this issue by checking the expression is evaluable or not before pushing it into Join. Add a regression test. Author: Davies Liu <[email protected]> Closes #16581 from davies/pyudf_join.

davies · 2017-01-21T05:11:53Z

Cherry-picked into 2.1 branch.

… of join ## What changes were proposed in this pull request? PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan. This PR fix this issue by checking the expression is evaluable or not before pushing it into Join. ## How was this patch tested? Add a regression test. Author: Davies Liu <[email protected]> Closes apache#16581 from davies/pyudf_join.

… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2a8cbfd) Signed-off-by: Wenchen Fan <[email protected]>

… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for apache#16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes apache#22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Fix Python UDF accessing attributes from both side of join

95d73fc

fix style

c45126b

rxin reviewed Jan 14, 2017

View reviewed changes

address comments

e4db820

davies commented Jan 17, 2017

View reviewed changes

bug fix

d6bba37

rollback change, fix test

f720c85

asfgit closed this in 9b7a03f Jan 21, 2017

xuanyuanking mentioned this pull request Sep 4, 2018

[SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions #22326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

davies commented Jan 13, 2017

davies commented Jan 13, 2017

SparkQA commented Jan 13, 2017

SparkQA commented Jan 14, 2017

rxin Jan 14, 2017

hvanhovell Jan 14, 2017

rxin Jan 14, 2017

davies Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 20, 2017

hvanhovell commented Jan 21, 2017 •

edited

Loading

hvanhovell commented Jan 21, 2017

davies commented Jan 21, 2017

[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

Conversation

davies commented Jan 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

davies commented Jan 13, 2017

SparkQA commented Jan 13, 2017

SparkQA commented Jan 14, 2017

rxin Jan 14, 2017

Choose a reason for hiding this comment

hvanhovell Jan 14, 2017

Choose a reason for hiding this comment

rxin Jan 14, 2017

Choose a reason for hiding this comment

davies Jan 17, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 20, 2017

hvanhovell commented Jan 21, 2017 • edited Loading

hvanhovell commented Jan 21, 2017

davies commented Jan 21, 2017

hvanhovell commented Jan 21, 2017 •

edited

Loading