[SPARK-20413] Add new query hint NO_COLLAPSE. #17708

ptkool · 2017-04-20T15:31:26Z

What changes were proposed in this pull request?

This PR proposes adding a new query hint called NO_COLLAPSE that can be used to prevent adjacent projections from being collapsed.

How was this patch tested?

Test using unit tests, integration tests and manual tests.

hvanhovell · 2017-04-20T16:53:28Z

ok to test

SparkQA · 2017-04-20T16:59:23Z

Test build #75995 has finished for PR 17708 at commit 3f1e6a1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NoCollapseHint(child: LogicalPlan) extends UnaryNode

hvanhovell

@ptkool thinks for submitting the PR. I am not sure this is the best way to avoid projection collapse. The problem is that this approach will also inhibit other optimization from taking place.

hvanhovell · 2017-04-20T16:54:29Z

python/pyspark/sql/functions.py

@@ -466,6 +466,14 @@ def nanvl(col1, col2):
    return Column(sc._jvm.functions.nanvl(_to_java_column(col1), _to_java_column(col2)))


+@since(2.2)
+def no_collapse(df):
+    """Marks a DataFrame as small enough for use in broadcast joins."""


Doc is incorrect.

hvanhovell · 2017-04-20T16:59:15Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-   * @group normal_funcs
-   * @since 1.5.0
-   */
+    * Marks a DataFrame as small enough for use in broadcast joins.


Please undo this change.

hvanhovell · 2017-04-20T17:00:15Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

  def broadcast[T](df: Dataset[T]): Dataset[T] = {
    Dataset[T](df.sparkSession, BroadcastHint(df.logicalPlan))(df.exprEnc)
  }

  /**
+    * Marks a DataFrame as small enough for use in broadcast joins.


Nit: the alignment is of by a space, it should be:

/** * Text... */

hvanhovell · 2017-04-20T17:00:48Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -387,6 +387,13 @@ case class BroadcastHint(child: LogicalPlan) extends UnaryNode {
 }

 /**
+ * A hint for the optimizer that we should not merge two projections.
+ */
+case class NoCollapseHint(child: LogicalPlan) extends UnaryNode {


Can you explain why we want this in the LogicalPlan level and not on the expression level?

The problem with this approach is that most other optimizations won't work with this, for example predicate push down.

I originally thought about putting it at the expression level, but ultimately decided it made more sense at the LogicalPlan node level, since the purpose was in fact to disrupt the optimizer. In some respects, it's meant to have the same effect as df.cache(), but without the caching. There may, in fact, be situations where predicate pushdown is not desired because the resulting condition would become complex and expensive to evaluate.

In Spark SQL, I think it also makes more sense to specify the hint at the derived table level, as opposed to a single expression. For instance,

SELECT SNO, PNO, C1 +1, C1 + 2
FROM ( SELECT /*+ NO_COLLAPSE */ SNO, PNO, QTY * 10 AS C1 FROM T ) T

This is similar to the NO_MERGE query hint in Oracle, which prevents the query from being flattened.

hvanhovell · 2017-04-20T17:18:29Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+
+    comparePlans(
+      parsePlan("SELECT a FROM (SELECT /*+ NO_COLLAPSE */ * FROM t) t1"),
+      SubqueryAlias("t1", Hint("NO_COLLAPSE", Seq.empty, table("t").select(star())))


What are you testing here that is not covered by the other cases?

Actually, nothing. I will remove it.

SparkQA · 2017-04-20T18:19:07Z

Test build #76001 has finished for PR 17708 at commit 975cca5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-20T22:38:23Z

Test build #76005 has finished for PR 17708 at commit 3986247.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-23T04:01:56Z

Based on the JIRA description, it sounds like we should not simply merge two Projects to avoid calling the same UDF multiple times, instead of adding a new logical plan node.

viirya · 2017-04-24T02:57:55Z

I have the same question as Reynold asked in the mailing list. Doesn't common sub expression elimination already address this issue?

gatorsmile · 2017-06-14T21:48:15Z

Any update? Maybe we can close this PR at first?

ptkool · 2017-06-26T19:38:15Z

@gatorsmile I will run a few more tests to determine if subexpression elimination solves this issue.

gatorsmile · 2017-06-27T06:45:08Z

We are closing the inactive PRs. After you run more test, please do reopen if you still hit this issue. Thanks!

Add new query hint NO_COLLAPSE.

3f1e6a1

ptkool changed the title ~~Add new query hint NO_COLLAPSE.~~ [SPARK-20413] Add new query hint NO_COLLAPSE. Apr 20, 2017

hvanhovell requested changes Apr 20, 2017

View reviewed changes

Resolve scalastyle errors.

3986247

ptkool force-pushed the no_collapse_query_hint branch from 1231585 to 3986247 Compare April 20, 2017 20:19

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20413] Add new query hint NO_COLLAPSE. #17708

[SPARK-20413] Add new query hint NO_COLLAPSE. #17708

ptkool commented Apr 20, 2017

hvanhovell commented Apr 20, 2017

SparkQA commented Apr 20, 2017

hvanhovell left a comment

hvanhovell Apr 20, 2017

ptkool Apr 20, 2017

hvanhovell Apr 20, 2017

ptkool Apr 20, 2017

hvanhovell Apr 20, 2017

ptkool Apr 20, 2017

hvanhovell Apr 20, 2017

hvanhovell Apr 20, 2017

ptkool Apr 20, 2017

hvanhovell Apr 20, 2017

ptkool Apr 20, 2017

SparkQA commented Apr 20, 2017

SparkQA commented Apr 20, 2017

gatorsmile commented Apr 23, 2017 •

edited

Loading

viirya commented Apr 24, 2017

gatorsmile commented Jun 14, 2017

ptkool commented Jun 26, 2017

gatorsmile commented Jun 27, 2017

[SPARK-20413] Add new query hint NO_COLLAPSE. #17708

[SPARK-20413] Add new query hint NO_COLLAPSE. #17708

Conversation

ptkool commented Apr 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented Apr 20, 2017

SparkQA commented Apr 20, 2017

hvanhovell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 20, 2017

SparkQA commented Apr 20, 2017

gatorsmile commented Apr 23, 2017 • edited Loading

viirya commented Apr 24, 2017

gatorsmile commented Jun 14, 2017

ptkool commented Jun 26, 2017

gatorsmile commented Jun 27, 2017

gatorsmile commented Apr 23, 2017 •

edited

Loading