Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581

Closed
wants to merge 5 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Jan 13, 2017

What changes were proposed in this pull request?

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.

How was this patch tested?

Add a regression test.

@davies
Copy link
Contributor Author

davies commented Jan 13, 2017

cc @hvanhovell

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71346 has finished for PR 16581 at commit 95d73fc.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2017

Test build #71349 has finished for PR 16581 at commit c45126b.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

case e: SubqueryExpression =>
// non-correlated subquery will be replaced as literal
e.children.nonEmpty
case e: Unevaluable => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need more documentation here on why should be considered evaluable as a join condition.

for example, just looking at this code i have no idea why Uneavaluable is evaluable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unevaluable is not evaluable. This block tries to find a case that is not evaluable in a join, and then negates it by isEmpty. I have to admit that we should document this.

@@ -342,6 +342,14 @@ def test_udf_in_filter_on_top_of_outer_join(self):
df = df.withColumn('b', udf(lambda x: 'x')(df.a))
self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')])

def test_udf_in_filter_on_top_of_join(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should reference jira number

@@ -342,6 +342,15 @@ def test_udf_in_filter_on_top_of_outer_join(self):
df = df.withColumn('b', udf(lambda x: 'x')(df.a))
self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')])

def test_udf_in_filter_on_top_of_join(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71520 has finished for PR 16581 at commit e4db820.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #3541 has started for PR 16581 at commit e4db820.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #3542 has finished for PR 16581 at commit d6bba37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71671 has finished for PR 16581 at commit d6bba37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71681 has finished for PR 16581 at commit f720c85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

hvanhovell commented Jan 21, 2017

LGTM - merging to master.

@hvanhovell
Copy link
Contributor

I cannot backport it, could open a PR for 2.1.

@asfgit asfgit closed this in 9b7a03f Jan 21, 2017
asfgit pushed a commit that referenced this pull request Jan 21, 2017
… of join

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.

Add a regression test.

Author: Davies Liu <[email protected]>

Closes #16581 from davies/pyudf_join.
@davies
Copy link
Contributor Author

davies commented Jan 21, 2017

Cherry-picked into 2.1 branch.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… of join

## What changes were proposed in this pull request?

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.

## How was this patch tested?

Add a regression test.

Author: Davies Liu <[email protected]>

Closes apache#16581 from davies/pyudf_join.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
… of join

## What changes were proposed in this pull request?

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable  or not before pushing it into Join.

## How was this patch tested?

Add a regression test.

Author: Davies Liu <[email protected]>

Closes apache#16581 from davies/pyudf_join.
asfgit pushed a commit that referenced this pull request Sep 27, 2018
… of join in join conditions

## What changes were proposed in this pull request?

Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition.

## How was this patch tested?

Add  regression tests in PySpark and `BatchEvalPythonExecSuite`.

Closes #22326 from xuanyuanking/SPARK-25314.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2a8cbfd)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Sep 27, 2018
… of join in join conditions

## What changes were proposed in this pull request?

Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition.

## How was this patch tested?

Add  regression tests in PySpark and `BatchEvalPythonExecSuite`.

Closes #22326 from xuanyuanking/SPARK-25314.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
daspalrahul pushed a commit to daspalrahul/spark that referenced this pull request Sep 29, 2018
… of join in join conditions

## What changes were proposed in this pull request?

Thanks for bahchis reporting this. It is more like a follow up work for apache#16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition.

## How was this patch tested?

Add  regression tests in PySpark and `BatchEvalPythonExecSuite`.

Closes apache#22326 from xuanyuanking/SPARK-25314.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants