-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101
Conversation
This is still in progress. I will add more test cases and update the PR description. |
Test build #125821 has finished for PR 29101 at commit
|
Just a question; if this proposal works well, we don't need the fix, #29075 ? |
okay, thanks for the check, @gengliangwang |
This one is ready for review. CC @cloud-fan @wangyum @AngersZhuuuu @maropu |
Test build #125914 has finished for PR 29101 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Outdated
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Show resolved
Hide resolved
...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConvertibleFilterSuite.scala
Outdated
Show resolved
Hide resolved
Test build #125952 has finished for PR 29101 at commit
|
Test build #125953 has finished for PR 29101 at commit
|
Test build #125946 has finished for PR 29101 at commit
|
Test build #125954 has finished for PR 29101 at commit
|
Test build #125987 has finished for PR 29101 at commit
|
Test build #125988 has finished for PR 29101 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Outdated
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Outdated
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Outdated
Show resolved
Hide resolved
...t/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala
Outdated
Show resolved
Hide resolved
Test build #125990 has finished for PR 29101 at commit
|
Test build #125993 has finished for PR 29101 at commit
|
retest this please. |
LGTM. It's a much simpler and robust solution! |
Test build #126020 has finished for PR 29101 at commit
|
retest this please. |
Test build #126039 has finished for PR 29101 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Outdated
Show resolved
Hide resolved
Test build #126069 has finished for PR 29101 at commit
|
retest this please. |
Test build #126083 has finished for PR 29101 at commit
|
retest this please. |
Test build #126125 has finished for PR 29101 at commit
|
This PR itself looks okay, so just to check; have you checked that this PR can get the same performance gain?
|
@maropu I checked the output of the optimized query plan of the 3 queries and they are equivalent. I think the performance result should be consistent. |
The github action checks are all passed. We don't need to wait for jenkins. @wangyum can you do the final sign-off? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have verified TPC-DS q13, TPC-DS q85 and TPC-H q19.
thanks, merging to master! |
Converting all the predicates into CNF may result in a very long predicate and the codegen become unnecessarily large. We should follow the approach of apache/spark#29101, which extracts all convertiable predicates gracefully. Author: Gengliang Wang <[email protected]> GitOrigin-RevId: 9ecedbd83f85b8235262c3de0bd83b4540cbc560
What changes were proposed in this pull request?
In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning.
It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate
will be converted into a long query(130K characters) in Hive metastore, and there will be error:
Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set.
Why are the changes needed?
A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests