Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101

Closed
wants to merge 7 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jul 14, 2020

What changes were proposed in this pull request?

In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning.

It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate

(p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20')

will be converted into a long query(130K characters) in Hive metastore, and there will be error:

javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ...

Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set.

Why are the changes needed?

A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@gengliangwang
Copy link
Member Author

This is still in progress. I will add more test cases and update the PR description.

@SparkQA
Copy link

SparkQA commented Jul 14, 2020

Test build #125821 has finished for PR 29101 at commit 25eb140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 15, 2020

Just a question; if this proposal works well, we don't need the fix, #29075 ?

@gengliangwang
Copy link
Member Author

Just a question; if this proposal works well, we don't need the fix, #29075 ?

@maropu Yes, this is a better solution to me.

@maropu
Copy link
Member

maropu commented Jul 15, 2020

okay, thanks for the check, @gengliangwang

@gengliangwang gengliangwang changed the title [WIP][SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions Jul 15, 2020
@gengliangwang
Copy link
Member Author

This one is ready for review. CC @cloud-fan @wangyum @AngersZhuuuu @maropu

@SparkQA
Copy link

SparkQA commented Jul 15, 2020

Test build #125914 has finished for PR 29101 at commit 97d9cd9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125952 has finished for PR 29101 at commit a051700.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125953 has finished for PR 29101 at commit 16c78d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125946 has finished for PR 29101 at commit 005f6fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125954 has finished for PR 29101 at commit 19fa997.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125987 has finished for PR 29101 at commit bfa2bc4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125988 has finished for PR 29101 at commit cb16660.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125990 has finished for PR 29101 at commit 543bd69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125993 has finished for PR 29101 at commit 452f3c9.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@cloud-fan
Copy link
Contributor

LGTM. It's a much simpler and robust solution!

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126020 has finished for PR 29101 at commit 452f3c9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jul 17, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126039 has finished for PR 29101 at commit 452f3c9.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126069 has finished for PR 29101 at commit d37c21c.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jul 18, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 18, 2020

Test build #126083 has finished for PR 29101 at commit d37c21c.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126125 has finished for PR 29101 at commit d37c21c.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 20, 2020

This PR itself looks okay, so just to check; have you checked that this PR can get the same performance gain?

SQL | Before this PR | After this PR
--- | --- | ---
TPCDS 5T Q13 | 84s | 21s
TPCDS 5T q85 | 66s | 34s
TPCH 1T q19 | 37s | 32s

#28733 (comment)

@gengliangwang
Copy link
Member Author

@maropu I checked the output of the optimized query plan of the 3 queries and they are equivalent. I think the performance result should be consistent.
after.txt
before.txt

@cloud-fan
Copy link
Contributor

The github action checks are all passed. We don't need to wait for jenkins. @wangyum can you do the final sign-off?

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have verified TPC-DS q13, TPC-DS q85 and TPC-H q19.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d0c83f3 Jul 20, 2020
mengtong-db pushed a commit to delta-io/delta that referenced this pull request Apr 6, 2021
Converting all the predicates into CNF may result in a very long predicate and the codegen become unnecessarily large.
We should follow the approach of apache/spark#29101, which extracts all convertiable predicates gracefully.

Author: Gengliang Wang <[email protected]>

GitOrigin-RevId: 9ecedbd83f85b8235262c3de0bd83b4540cbc560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants