[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC #24598

gengliangwang · 2019-05-14T09:51:35Z

What changes were proposed in this pull request?

Currently, in ParquetFilters and OrcFilters, if the child predicate of Or operator can't be entirely pushed down, the predicates will be thrown away.
In fact, the conjunctive predicates under Or operators can be partially pushed down.
For example, says a and b are convertible, while c can't be pushed down, the predicate
a or (b and c)
can be converted as
(a or b) and (a or c)
We can still push down (a or b).
We can't push down disjunctive predicates only when one of its children is not partially convertible.

This PR also improve the filter pushing down logic in DataSourceV2Strategy. With partial filter push down in Or operator, the result of pushedFilters() might not exist in the mapping translatedFilterToExpr. To fix it, this PR changes the mapping translatedFilterToExpr as leaf filter expression to sources.filter, and later on rebuild the whole expression with the mapping.

How was this patch tested?

Unit test

SparkQA · 2019-05-14T12:44:58Z

Test build #105380 has finished for PR 24598 at commit 490e00c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-05-14T12:50:38Z

@dbtsai @cloud-fan @dongjoon-hyun

cloud-fan · 2019-05-14T14:07:28Z

looks reasonable to me.

dongjoon-hyun · 2019-05-14T16:17:50Z

Thank you for pinging me, @gengliangwang . I'll take a look today~

HyukjinKwon · 2019-05-15T00:51:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

@@ -529,9 +529,9 @@ private[parquet] class ParquetFilters(
      case sources.Or(lhs, rhs) =>
        for {
          lhsFilter <-
-            createFilterHelper(nameToParquetField, lhs, canPartialPushDownConjuncts = false)


Can we add some comments like we did in AND?

dongjoon-hyun · 2019-05-15T02:06:47Z

sql/core/v1.2.1/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

-          _ <- createBuilder(dataTypeMap, left, newBuilder, canPartialPushDownConjuncts = false)
-          _ <- createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts = false)
+          _ <- createBuilder(dataTypeMap, left, newBuilder, canPartialPushDownConjuncts = true)
+          _ <- createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts = true)


Hi, @gengliangwang .
Is there a reason not to update another OrcFilters? We had better be consistent for both in order to support Hadoop 3 in Spark 3.0.0.

sql/core/v2.3.4/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

Also, cc @wangyum since he is actively working on Hadoop 3.2 support.

dongjoon-hyun · 2019-05-15T02:10:24Z

...re/v1.2.1/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

@@ -398,6 +387,55 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
    }
  }

+  test("SPARK-27699 Converting disjunctions into ORC SearchArguments") {


ditto for this test case.

SparkQA · 2019-05-15T07:05:01Z

Test build #105397 has finished for PR 24598 at commit 38519d3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-05-15T07:10:41Z

retest this please.

dongjoon-hyun · 2019-05-15T07:11:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        // (a1 AND a2) OR (b1 AND b2),
+        // a1 and b1 is convertible, while a2 and b2 is not.
+        // The predicate can be converted as
+        // (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)


The rule is correct, but it looks a little dangerous to set true blindly without considering the given parameter canPartialPushDownConjuncts. Could you add a test case for that complex example like OR under NOT in the deep predicate tree?

Also, it would be great to add more higher level test case in SQLQuerySuite.scala to show the benefit of this additional predicate pushdown a1 OR b1. Could you add that, too?

The rule is correct, but it looks a little dangerous to set true blindly without considering the given parameter canPartialPushDownConjuncts. Could you add a test case for that complex example like OR under NOT in the deep predicate tree?

@dongjoon-hyun nice catch. However, here the Not predicate won't have a child as Or or And predicate because of BooleanSimplification optimization rule.
I have updated the code, but a new test case seems unnecessary. Not(a Or b) can be converted as Not(a) And Not(b)
Also, I have created a PR for pushing down Not operator before(for double insurance), but seems the PR made thing too complex: #22687

Also, it would be great to add more higher level test case in SQLQuerySuite.scala to show the benefit of this additional predicate pushdown a1 OR b1. Could you add that, too?

How can we verify the predicate is pushed down? Match the OrcScan and check the pushedFilters? Only Orc V2 can be checked in this way currently.

The conversion from (a1 AND a2) OR (b1 AND b2) to(a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2) always works no matter it's a top level boolean expression or not.

@gengliangwang . You can use explain with Console.withOut. What we need is to check PushedFilters: in the plan.

How can we verify the predicate is pushed down? Match the OrcScan and check the pushedFilters? Only Orc V2 can be checked in this way currently.

@cloud-fan and @gengliangwang . Yes. Of course, I already agreed that the rule is correct. I want to have a test case in a complete end-to-end query form which shows the new benefit clearly. Could you please add a real use case you met?

+1 for an end-to-end test

viirya

Could we apply this on org.apache.spark.sql.hive.orc.OrcFilters.createBuilder, too?

SparkQA · 2019-05-15T10:29:54Z

Test build #105406 has finished for PR 24598 at commit 38519d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-15T10:50:46Z

Test build #105409 has finished for PR 24598 at commit 3b88687.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-15T14:15:06Z

Test build #105415 has finished for PR 24598 at commit c65e026.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-05-15T15:21:50Z

Test build #105417 has finished for PR 24598 at commit caeb64d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-05-16T15:49:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

-        Some(sources.IsNull(a.name))
-      case expressions.IsNotNull(a: Attribute) =>
-        Some(sources.IsNotNull(a.name))
+    translateFilterWithMapping(predicate, None)


Could you add more explanation why this PR needs to have translateFilterWithMapping and translateLeafNodeFilter? Is this inevitable refactoring?

This is inevitable. See my comments below or the ending part of the PR description.

dongjoon-hyun · 2019-05-16T15:50:26Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "") {
+      withTempPath { dir =>
+        spark.range(10).map(i => (i, i.toString)).toDF("id", "s").write.orc(dir.getCanonicalPath)
+        val df = spark.read.orc(dir.getCanonicalPath)


It would be better to have Parquet testing because it's the default format in Apache Spark. But, Yes. I got it. You want to use a DSv2 way for testing.

We can have a follow up after the Parquet V2 is merged.

gengliangwang · 2019-05-16T15:50:23Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

-          if (translated.isDefined) {
-            translatedFilterToExpr(translated.get) = filterExpr
-          } else {
+          val translated =


With partial filter push down in Or operator, the result of pushedFilters() might not exist in the mapping translatedFilterToExpr. To fix it, this PR changes the mapping translatedFilterToExpr as leaf filter expression to sources.filter, and later on rebuild the whole expression with the mapping.

Yep. Actually, when I tested your PR before, I also noticed that. The new Or doesn't work in the end-to-end case. That's the reason to ask this. I'll test more.

Thank you for making this PR working, @gengliangwang !

BTW, it would be great to add a real comment in the code. It's non-trivial to figure out.

Thanks for for the testing. Appreciate it!

SparkQA · 2019-05-16T18:41:58Z

Test build #105458 has finished for PR 24598 at commit 5f412bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-16T19:35:25Z

Test build #105463 has finished for PR 24598 at commit b8cb843.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-05-16T21:22:09Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

-        // A map from translated data source filters to original catalyst filter expressions.
+        // A map from translated data source leaf node filters to original catalyst filter
+        // expressions. For a `And`/`Or` predicate, it is possible that the predicate is partially
+        // pushed down. This map can used to construct a catalyst filter expression from the input


can used -> can be used?

SparkQA · 2019-05-17T03:13:27Z

Test build #105476 has finished for PR 24598 at commit 4d84060.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-17T03:40:46Z

Test build #105477 has finished for PR 24598 at commit 90b0b69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-17T03:47:31Z

sql/core/v1.2.1/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

-    } yield filter
+    import org.apache.spark.sql.sources._
+
+    def convertibleFiltersHelper(


it's weird to call a method "helper", what does it do?

This is a helper method for converting Filter to Expression recursively. It can also be _convertibleFilters or convertibleFilters0...
Here it is following createFilterHelper in ParquetFilters.

cloud-fan · 2019-05-17T11:25:37Z

thanks, merging to master!

… Join/Partitions ### What changes were proposed in this pull request? In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gengliangwang mentioned this pull request May 14, 2019

[SPARK-27698][SQL] Add new method convertibleFilters for getting pushed down filters in Parquet file reader #24597

Closed

HyukjinKwon reviewed May 15, 2019

View reviewed changes

dongjoon-hyun reviewed May 15, 2019

View reviewed changes

viirya reviewed May 15, 2019

View reviewed changes

gengliangwang added 4 commits May 15, 2019 20:20

Partially push down disjunctive predicated in Parquet/ORC

cee1c73

address comments

3f7ef8a

address comment

529b207

improve OrcFilters in Hive project as well

caeb64d

gengliangwang force-pushed the pushdownDisjunctivePredicates branch from c65e026 to caeb64d Compare May 15, 2019 12:23

gengliangwang added 2 commits May 16, 2019 17:18

fix convertibleFilters

0fe7194

fix DataSourceV2Strategy

5f412bd

dongjoon-hyun reviewed May 16, 2019

View reviewed changes

gengliangwang commented May 16, 2019

View reviewed changes

add comment

b8cb843

dongjoon-hyun reviewed May 16, 2019

View reviewed changes

gengliangwang added 2 commits May 17, 2019 07:59

revise

4d84060

update 2.3.4

90b0b69

cloud-fan reviewed May 17, 2019

View reviewed changes

cloud-fan closed this in e39e97b May 17, 2019

IvanVergiliev mentioned this pull request May 27, 2019

[SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion #24068

Closed

AngersZhuuuu mentioned this pull request Jun 27, 2019

[SPARK-28169][SQL] Fix Partition table partition PushDown failed by "OR" expression #24973

Closed

gengliangwang mentioned this pull request Jun 6, 2020

[WIP][SPARK-31919][SQL] Push down more predicates through Join #28741

Closed

cloud-fan mentioned this pull request Jun 22, 2020

[SPARK-28169][SQL] Convert scan predicate condition to CNF #28805

Closed

gengliangwang mentioned this pull request Jul 14, 2020

[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101

Closed

[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC #24598

[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC #24598

Conversation

gengliangwang commented May 14, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 14, 2019

gengliangwang commented May 14, 2019

cloud-fan commented May 14, 2019

dongjoon-hyun commented May 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 15, 2019

gengliangwang commented May 15, 2019

Choose a reason for hiding this comment

gengliangwang May 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented May 15, 2019

SparkQA commented May 15, 2019

SparkQA commented May 15, 2019

SparkQA commented May 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 16, 2019

SparkQA commented May 16, 2019

Choose a reason for hiding this comment

SparkQA commented May 17, 2019

SparkQA commented May 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented May 17, 2019

gengliangwang commented May 14, 2019 •

edited

Loading

gengliangwang May 15, 2019 •

edited

Loading

dongjoon-hyun May 16, 2019 •

edited

Loading

dongjoon-hyun May 16, 2019 •

edited

Loading