-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18589] [SQL] Fix Python UDF accessing attributes from both side of join #16581
Conversation
cc @hvanhovell |
Test build #71346 has finished for PR 16581 at commit
|
Test build #71349 has finished for PR 16581 at commit
|
case e: SubqueryExpression => | ||
// non-correlated subquery will be replaced as literal | ||
e.children.nonEmpty | ||
case e: Unevaluable => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need more documentation here on why should be considered evaluable as a join condition.
for example, just looking at this code i have no idea why Uneavaluable is evaluable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unevaluable
is not evaluable. This block tries to find a case that is not evaluable in a join, and then negates it by isEmpty. I have to admit that we should document this.
@@ -342,6 +342,14 @@ def test_udf_in_filter_on_top_of_outer_join(self): | |||
df = df.withColumn('b', udf(lambda x: 'x')(df.a)) | |||
self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')]) | |||
|
|||
def test_udf_in_filter_on_top_of_join(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should reference jira number
@@ -342,6 +342,15 @@ def test_udf_in_filter_on_top_of_outer_join(self): | |||
df = df.withColumn('b', udf(lambda x: 'x')(df.a)) | |||
self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')]) | |||
|
|||
def test_udf_in_filter_on_top_of_join(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Test build #71520 has finished for PR 16581 at commit
|
Test build #3541 has started for PR 16581 at commit |
Test build #3542 has finished for PR 16581 at commit
|
Test build #71671 has finished for PR 16581 at commit
|
Test build #71681 has finished for PR 16581 at commit
|
LGTM - merging to master. |
I cannot backport it, could open a PR for 2.1. |
… of join PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan. This PR fix this issue by checking the expression is evaluable or not before pushing it into Join. Add a regression test. Author: Davies Liu <[email protected]> Closes #16581 from davies/pyudf_join.
Cherry-picked into 2.1 branch. |
… of join ## What changes were proposed in this pull request? PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan. This PR fix this issue by checking the expression is evaluable or not before pushing it into Join. ## How was this patch tested? Add a regression test. Author: Davies Liu <[email protected]> Closes apache#16581 from davies/pyudf_join.
… of join ## What changes were proposed in this pull request? PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan. This PR fix this issue by checking the expression is evaluable or not before pushing it into Join. ## How was this patch tested? Add a regression test. Author: Davies Liu <[email protected]> Closes apache#16581 from davies/pyudf_join.
… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2a8cbfd) Signed-off-by: Wenchen Fan <[email protected]>
… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for apache#16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes apache#22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.
How was this patch tested?
Add a regression test.