[SPARK-32302][SQL] Partially push down disjunctive predicates through…

… Join/Partitions ### What changes were proposed in this pull request? In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
apache · Jul 20, 2020 · d0c83f3 · d0c83f3
1 parent c2afe1c
commit d0c83f3
Show file tree

Hide file tree

Showing 10 changed files with 208 additions and 324 deletions.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
@@ -202,125 +202,50 @@ trait PredicateHelper extends Logging {
   }
 
   /**
-   * Convert an expression into conjunctive normal form.
-   * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form
-   * CNF can explode exponentially in the size of the input expression when converting [[Or]]
-   * clauses. Use a configuration [[SQLConf.MAX_CNF_NODE_COUNT]] to prevent such cases.
-   *
-   * @param condition to be converted into CNF.
-   * @return the CNF result as sequence of disjunctive expressions. If the number of expressions
-   *         exceeds threshold on converting `Or`, `Seq.empty` is returned.
+   * Returns a filter that its reference is a subset of `outputSet` and it contains the maximum
+   * constraints from `condition`. This is used for predicate pushdown.
+   * When there is no such filter, `None` is returned.
    */
-  protected def conjunctiveNormalForm(
+  protected def extractPredicatesWithinOutputSet(
       condition: Expression,
-      groupExpsFunc: Seq[Expression] => Seq[Expression]): Seq[Expression] = {
-    val postOrderNodes = postOrderTraversal(condition)
-    val resultStack = new mutable.Stack[Seq[Expression]]
-    val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
-    // Bottom up approach to get CNF of sub-expressions
-    while (postOrderNodes.nonEmpty) {
-      val cnf = postOrderNodes.pop() match {
-        case _: And =>
-          val right = resultStack.pop()
-          val left = resultStack.pop()
-          left ++ right
-        case _: Or =>
-          // For each side, there is no need to expand predicates of the same references.
-          // So here we can aggregate predicates of the same qualifier as one single predicate,
-          // for reducing the size of pushed down predicates and corresponding codegen.
-          val right = groupExpsFunc(resultStack.pop())
-          val left = groupExpsFunc(resultStack.pop())
-          // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
-          if (left.size * right.size > maxCnfNodeCount) {
-            logInfo(s"As the result size exceeds the threshold $maxCnfNodeCount. " +
-              "The CNF conversion is skipped and returning Seq.empty now. To avoid this, you can " +
-              s"raise the limit ${SQLConf.MAX_CNF_NODE_COUNT.key}.")
-            return Seq.empty
-          } else {
-            for { x <- left; y <- right } yield Or(x, y)
-          }
-        case other => other :: Nil
+      outputSet: AttributeSet): Option[Expression] = condition match {
+    case And(left, right) =>
+      val leftResultOptional = extractPredicatesWithinOutputSet(left, outputSet)
+      val rightResultOptional = extractPredicatesWithinOutputSet(right, outputSet)
+      (leftResultOptional, rightResultOptional) match {
+        case (Some(leftResult), Some(rightResult)) => Some(And(leftResult, rightResult))
+        case (Some(leftResult), None) => Some(leftResult)
+        case (None, Some(rightResult)) => Some(rightResult)
+        case _ => None
       }
-      resultStack.push(cnf)
-    }
-    if (resultStack.length != 1) {
-      logWarning("The length of CNF conversion result stack is supposed to be 1. There might " +
-        "be something wrong with CNF conversion.")
-      return Seq.empty
-    }
-    resultStack.top
-  }
-
-  /**
-   * Convert an expression to conjunctive normal form when pushing predicates through Join,
-   * when expand predicates, we can group by the qualifier avoiding generate unnecessary
-   * expression to control the length of final result since there are multiple tables.
-   *
-   * @param condition condition need to be converted
-   * @return the CNF result as sequence of disjunctive expressions. If the number of expressions
-   *         exceeds threshold on converting `Or`, `Seq.empty` is returned.
-   */
-  def CNFWithGroupExpressionsByQualifier(condition: Expression): Seq[Expression] = {
-    conjunctiveNormalForm(condition, (expressions: Seq[Expression]) =>
-        expressions.groupBy(_.references.map(_.qualifier)).map(_._2.reduceLeft(And)).toSeq)
-  }
-
-  /**
-   * Convert an expression to conjunctive normal form for predicate pushdown and partition pruning.
-   * When expanding predicates, this method groups expressions by their references for reducing
-   * the size of pushed down predicates and corresponding codegen. In partition pruning strategies,
-   * we split filters by [[splitConjunctivePredicates]] and partition filters by judging if it's
-   * references is subset of partCols, if we combine expressions group by reference when expand
-   * predicate of [[Or]], it won't impact final predicate pruning result since
-   * [[splitConjunctivePredicates]] won't split [[Or]] expression.
-   *
-   * @param condition condition need to be converted
-   * @return the CNF result as sequence of disjunctive expressions. If the number of expressions
-   *         exceeds threshold on converting `Or`, `Seq.empty` is returned.
-   */
-  def CNFWithGroupExpressionsByReference(condition: Expression): Seq[Expression] = {
-    conjunctiveNormalForm(condition, (expressions: Seq[Expression]) =>
-        expressions.groupBy(e => AttributeSet(e.references)).map(_._2.reduceLeft(And)).toSeq)
-  }
 
-  /**
-   * Iterative post order traversal over a binary tree built by And/Or clauses with two stacks.
-   * For example, a condition `(a And b) Or c`, the postorder traversal is
-   * (`a`,`b`, `And`, `c`, `Or`).
-   * Following is the complete algorithm. After step 2, we get the postorder traversal in
-   * the second stack.
-   * 1. Push root to first stack.
-   * 2. Loop while first stack is not empty
-   *    2.1 Pop a node from first stack and push it to second stack
-   *    2.2 Push the children of the popped node to first stack
-   *
-   * @param condition to be traversed as binary tree
-   * @return sub-expressions in post order traversal as a stack.
-   *         The first element of result stack is the leftmost node.
-   */
-  private def postOrderTraversal(condition: Expression): mutable.Stack[Expression] = {
-    val stack = new mutable.Stack[Expression]
-    val result = new mutable.Stack[Expression]
-    stack.push(condition)
-    while (stack.nonEmpty) {
-      val node = stack.pop()
-      node match {
-        case Not(a And b) => stack.push(Or(Not(a), Not(b)))
-        case Not(a Or b) => stack.push(And(Not(a), Not(b)))
-        case Not(Not(a)) => stack.push(a)
-        case a And b =>
-          result.push(node)
-          stack.push(a)
-          stack.push(b)
-        case a Or b =>
-          result.push(node)
-          stack.push(a)
-          stack.push(b)
-        case _ =>
-          result.push(node)
+    // The Or predicate is convertible when both of its children can be pushed down.
+    // That is to say, if one/both of the children can be partially pushed down, the Or
+    // predicate can be partially pushed down as well.
+    //
+    // Here is an example used to explain the reason.
+    // Let's say we have
+    // condition: (a1 AND a2) OR (b1 AND b2),
+    // outputSet: AttributeSet(a1, b1)
+    // a1 and b1 is convertible, while a2 and b2 is not.
+    // The predicate can be converted as
+    // (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)
+    // As per the logical in And predicate, we can push down (a1 OR b1).
+    case Or(left, right) =>
+      for {
+        lhs <- extractPredicatesWithinOutputSet(left, outputSet)
+        rhs <- extractPredicatesWithinOutputSet(right, outputSet)
+      } yield Or(lhs, rhs)
+
+    // Here we assume all the `Not` operators is already below all the `And` and `Or` operators
+    // after the optimization rule `BooleanSimplification`, so that we don't need to handle the
+    // `Not` operators here.
+    case other =>
+      if (other.references.subsetOf(outputSet)) {
+        Some(other)
+      } else {
+        None
       }
-    }
-    result
   }
 }
 

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -51,8 +51,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
   override protected val excludedOnceBatches: Set[String] =
     Set(
       "PartitionPruning",
-      "Extract Python UDFs",
-      "Push CNF predicate through join")
+      "Extract Python UDFs")
 
   protected def fixedPoint =
     FixedPoint(
@@ -123,8 +122,9 @@ abstract class Optimizer(catalogManager: CatalogManager)
         rulesWithoutInferFiltersFromConstraints: _*) ::
       // Set strategy to Once to avoid pushing filter every time because we do not change the
       // join condition.
-      Batch("Push CNF predicate through join", Once,
-        PushCNFPredicateThroughJoin) :: Nil
+      Batch("Push extra predicate through join", fixedPoint,
+        PushExtraPredicateThroughJoin,
+        PushDownPredicates) :: Nil
     }
 
     val batches = (Batch("Eliminate Distinct", Once, EliminateDistinct) ::

diff --git a/...timizer/PushCNFPredicateThroughJoin.scala → ...mizer/PushExtraPredicateThroughJoin.scala b/...timizer/PushCNFPredicateThroughJoin.scala → ...mizer/PushExtraPredicateThroughJoin.scala
@@ -17,18 +17,20 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.apache.spark.sql.catalyst.expressions.{And, PredicateHelper}
+import org.apache.spark.sql.catalyst.expressions.{And, Expression, PredicateHelper}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LogicalPlan}
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
 
 /**
- * Try converting join condition to conjunctive normal form expression so that more predicates may
- * be able to be pushed down.
+ * Try pushing down disjunctive join condition into left and right child.
  * To avoid expanding the join condition, the join condition will be kept in the original form even
  * when predicate pushdown happens.
  */
-object PushCNFPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper {
+object PushExtraPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper {
+
+  private val processedJoinConditionTag = TreeNodeTag[Expression]("processedJoinCondition")
 
   private def canPushThrough(joinType: JoinType): Boolean = joinType match {
     case _: InnerLike | LeftSemi | RightOuter | LeftOuter | LeftAnti | ExistenceJoin(_) => true
@@ -38,22 +40,28 @@ object PushCNFPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelpe
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case j @ Join(left, right, joinType, Some(joinCondition), hint)
         if canPushThrough(joinType) =>
-      val predicates = CNFWithGroupExpressionsByQualifier(joinCondition)
-      if (predicates.isEmpty) {
+      val alreadyProcessed = j.getTagValue(processedJoinConditionTag).exists { condition =>
+        condition.semanticEquals(joinCondition)
+      }
+
+      lazy val filtersOfBothSide = splitConjunctivePredicates(joinCondition).filter { f =>
+        f.deterministic && f.references.nonEmpty &&
+          !f.references.subsetOf(left.outputSet) && !f.references.subsetOf(right.outputSet)
+      }
+      lazy val leftExtraCondition =
+        filtersOfBothSide.flatMap(extractPredicatesWithinOutputSet(_, left.outputSet))
+      lazy val rightExtraCondition =
+        filtersOfBothSide.flatMap(extractPredicatesWithinOutputSet(_, right.outputSet))
+
+      if (alreadyProcessed || (leftExtraCondition.isEmpty && rightExtraCondition.isEmpty)) {
         j
       } else {
-        val pushDownCandidates = predicates.filter(_.deterministic)
-        lazy val leftFilterConditions =
-          pushDownCandidates.filter(_.references.subsetOf(left.outputSet))
-        lazy val rightFilterConditions =
-          pushDownCandidates.filter(_.references.subsetOf(right.outputSet))
-
         lazy val newLeft =
-          leftFilterConditions.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
+          leftExtraCondition.reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
         lazy val newRight =
-          rightFilterConditions.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
+          rightExtraCondition.reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
 
-        joinType match {
+        val newJoin = joinType match {
           case _: InnerLike | LeftSemi =>
             Join(newLeft, newRight, joinType, Some(joinCondition), hint)
           case RightOuter =>
@@ -63,6 +71,8 @@ object PushCNFPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelpe
           case other =>
             throw new IllegalStateException(s"Unexpected join type: $other")
         }
-      }
+        newJoin.setTagValue(processedJoinConditionTag, joinCondition)
+        newJoin
+    }
   }
 }
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -545,19 +545,6 @@ object SQLConf {
     .booleanConf
     .createWithDefault(true)
 
-  val MAX_CNF_NODE_COUNT =
-    buildConf("spark.sql.optimizer.maxCNFNodeCount")
-      .internal()
-      .doc("Specifies the maximum allowable number of conjuncts in the result of CNF " +
-        "conversion. If the conversion exceeds the threshold, an empty sequence is returned. " +
-        "For example, CNF conversion of (a && b) || (c && d) generates " +
-        "four conjuncts (a || c) && (a || d) && (b || c) && (b || d).")
-      .version("3.1.0")
-      .intConf
-      .checkValue(_ >= 0,
-        "The depth of the maximum rewriting conjunction normal form must be positive.")
-      .createWithDefault(128)
-
   val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals")
     .internal()
     .doc("When true, string literals (including regex patterns) remain escaped in our SQL " +
@@ -2954,8 +2941,6 @@ class SQLConf extends Serializable with Logging {
 
   def constraintPropagationEnabled: Boolean = getConf(CONSTRAINT_PROPAGATION_ENABLED)
 
-  def maxCnfNodeCount: Int = getConf(MAX_CNF_NODE_COUNT)
-
   def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS)
 
   def fileCompressionFactor: Double = getConf(FILE_COMPRESSION_FACTOR)