[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

gatorsmile · 2016-02-20T15:35:56Z

What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:

SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP

Original logical plan:

  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation

Converted SQL:

  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())

How was the this patch tested?

Added eight test cases in LogicalPlanToSQLSuite.

SparkQA · 2016-02-20T17:38:48Z

Test build #51597 has finished for PR 11283 at commit bc0c030.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-04T01:24:34Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+      plan: Aggregate,
+      expand: Expand,
+      project: Project): String = {
+    require(plan.groupingExpressions.length > 1)


nit: I think assert is better here, as if it breaks, it means something goes wrong in our system.

Will do. Thanks!

cloud-fan · 2016-03-04T01:33:25Z

LGTM except some minor comments, thanks for working on it!

cc @liancheng

gatorsmile · 2016-03-04T01:55:53Z

@cloud-fan Really thank you for your time and your detailed reviews!!! : )

cloud-fan · 2016-03-04T02:09:56Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

@@ -107,6 +107,11 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi
    case p: Project =>
      projectToSQL(p, isDistinct = false)

+    case a @ Aggregate(_, _, e @ Expand(_, _, p: Project))
+      if sameOutput(e.output,
+        p.child.output ++ a.groupingExpressions.map(_.asInstanceOf[Attribute])) =>


Sorry misses this: should check isInstanceOf before calling asInstanceOf directly.
We can put all of it in one method and use it as if condition.

Thank you for pointing it out! I will be more careful next time.

SparkQA · 2016-03-04T04:07:08Z

Test build #52433 has finished for PR 11283 at commit 6f609fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T05:08:54Z

Test build #52440 has finished for PR 11283 at commit 9eaca51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-04T10:54:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+    // a map from group by attributes to the original group by expressions.
+    val groupByAttrMap = AttributeMap(groupByAttributes.zip(groupByExprs))
+
+    val groupingSet = expand.projections.map { project =>


Nit: Would be nice to add type annotation to groupingSet since it's a relatively complex nested data structure and can be hard to grasp.

liancheng · 2016-03-04T10:59:16Z

LGTM except a few comments. Thanks for donging this!

Would you please run HiveCompatibilitySuite locally and check sql/hive/target/unit-tests.log to confirm that all queries with grouping set are correctly generated? It takes about 10~15 min to run the suite. I found that Jenkins skipped this suite.

liancheng · 2016-03-04T10:59:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+            Grouping(groupingCol.get)
+          } else {
+            throw new UnsupportedOperationException(s"unsupported operator $a")
+          }


The following version might be clearer:

val groupingCol = groupByExprs.applyOrElse( idx, throw new UnsupportedOperationException(s"unsupported operator $a") Grouping(groupingCol)

And I don't quite get the meaning of the exception error message...

Here, if the value is out of boundary, I thought we should not continue the conversion. After rethinking this, users might call grouping_id() inside such a function. Maybe we should not throw any exception. How about changing it to

groupByExprs.lift(idx).map(Grouping).getOrElse(a)

Makes sense.

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

gatorsmile · 2016-03-05T07:11:26Z

@liancheng I ran the suite in my local laptop. All the related tests works. However, I hit multiple regression failure that was introduced by another PR: #11466

I will submit a separate PR to fix the issues. Thank you for your reviews!

SparkQA · 2016-03-05T09:12:40Z

Test build #52508 has finished for PR 11283 at commit 385c0d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-05T10:03:54Z

Thanks for running the test. Could you please name these regressions? I couldn't find them. Seems that currently Jenkins PR builder looks good.

liancheng · 2016-03-05T11:24:04Z

This LGTM now. Merging to master. Thanks!

liancheng · 2016-03-05T11:36:00Z

A final thought about this PR:

For SQL generation for grouping sets, we have to depend on a set of implicit assumptions related to implementation details of specific analysis rules, which makes the implementation tend to be fragile. I think maybe we can add an auxiliary logical plan operator AnnotatePlan and an expression AnnotateExpression, which can be used to annotate a sub-tree of a logical plan / expression to indicate their original forms. These nodes can be added by the analyzer and should be erased by the optimizer before doing optimizations.

Haven't thought thoroughly about this, but with the help of these annotations, I'd expect it to be easier to recognize patterns corresponding to plans / expressions like grouping set, grouping, multi-distinct aggregation etc..

gatorsmile · 2016-03-05T14:40:37Z

Yeah, definitely, it helps a lot. Otherwise, toSQL needs to identify the pattern and convert it back using a few assumptions. The pattern and assumptions we made depends on the implementation of our analyzer rules.

Before we finalize the design, I will first stop working on the SQL generation of multi-distinct aggregation. Thanks! @liancheng

…ing Sets #### What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` #### How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#11283 from gatorsmile/groupingSetsToSQL.

gatorsmile and others added 29 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge remote-tracking branch 'upstream/master'

ab6dbd7

Merge remote-tracking branch 'upstream/master'

4276356

Merge remote-tracking branch 'upstream/master'

2dab708

Merge remote-tracking branch 'upstream/master'

0458770

Merge remote-tracking branch 'upstream/master'

1debdfa

Merge remote-tracking branch 'upstream/master'

763706d

Merge remote-tracking branch 'upstream/master'

4de6ec1

Merge remote-tracking branch 'upstream/master'

9422a4f

Merge remote-tracking branch 'upstream/master'

52bdf48

Merge remote-tracking branch 'upstream/master'

1e95df3

Merge remote-tracking branch 'upstream/master'

fab24cf

Merge remote-tracking branch 'upstream/master'

8b2e33b

Merge remote-tracking branch 'upstream/master'

2ee1876

Merge remote-tracking branch 'upstream/master'

b9f0090

Merge remote-tracking branch 'upstream/master'

ade6f7e

Merge remote-tracking branch 'upstream/master'

9fd63d2

SQL generation support for cube, rollup, and grouping set

bc0c030

Merge remote-tracking branch 'upstream/master' into groupingSetsToSQL

9292abe

cloud-fan reviewed Mar 4, 2016
View reviewed changes

address comments.

6f609fb

cloud-fan reviewed Mar 4, 2016
View reviewed changes

address comments.

9eaca51

liancheng reviewed Mar 4, 2016
View reviewed changes

gatorsmile added 3 commits March 4, 2016 21:48

address comments.

b8786b2

Merge remote-tracking branch 'upstream/master'

59daa48

Merge branch 'groupingSetsToSQLNew' into groupingSetsToSQLNewNewNew

385c0d9

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

asfgit closed this in adce5ee Mar 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

gatorsmile commented Feb 20, 2016

SparkQA commented Feb 20, 2016

cloud-fan Mar 4, 2016

gatorsmile Mar 4, 2016

cloud-fan commented Mar 4, 2016

gatorsmile commented Mar 4, 2016

cloud-fan Mar 4, 2016

gatorsmile Mar 4, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

liancheng Mar 4, 2016

gatorsmile Mar 5, 2016

liancheng commented Mar 4, 2016

liancheng Mar 4, 2016

gatorsmile Mar 5, 2016

liancheng Mar 5, 2016

gatorsmile commented Mar 5, 2016

SparkQA commented Mar 5, 2016

liancheng commented Mar 5, 2016

liancheng commented Mar 5, 2016

liancheng commented Mar 5, 2016

gatorsmile commented Mar 5, 2016

[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

Conversation

gatorsmile commented Feb 20, 2016

What changes were proposed in this pull request?

How was the this patch tested?

SparkQA commented Feb 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 4, 2016

gatorsmile commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Mar 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 5, 2016

SparkQA commented Mar 5, 2016

liancheng commented Mar 5, 2016

liancheng commented Mar 5, 2016

liancheng commented Mar 5, 2016

gatorsmile commented Mar 5, 2016