-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283
Conversation
Test build #51597 has finished for PR 11283 at commit
|
plan: Aggregate, | ||
expand: Expand, | ||
project: Project): String = { | ||
require(plan.groupingExpressions.length > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think assert
is better here, as if it breaks, it means something goes wrong in our system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Thanks!
LGTM except some minor comments, thanks for working on it! cc @liancheng |
@cloud-fan Really thank you for your time and your detailed reviews!!! : ) |
@@ -107,6 +107,11 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi | |||
case p: Project => | |||
projectToSQL(p, isDistinct = false) | |||
|
|||
case a @ Aggregate(_, _, e @ Expand(_, _, p: Project)) | |||
if sameOutput(e.output, | |||
p.child.output ++ a.groupingExpressions.map(_.asInstanceOf[Attribute])) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry misses this: should check isInstanceOf
before calling asInstanceOf
directly.
We can put all of it in one method and use it as if condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing it out! I will be more careful next time.
Test build #52433 has finished for PR 11283 at commit
|
Test build #52440 has finished for PR 11283 at commit
|
// a map from group by attributes to the original group by expressions. | ||
val groupByAttrMap = AttributeMap(groupByAttributes.zip(groupByExprs)) | ||
|
||
val groupingSet = expand.projections.map { project => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would be nice to add type annotation to groupingSet
since it's a relatively complex nested data structure and can be hard to grasp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
LGTM except a few comments. Thanks for donging this! Would you please run |
Grouping(groupingCol.get) | ||
} else { | ||
throw new UnsupportedOperationException(s"unsupported operator $a") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following version might be clearer:
val groupingCol = groupByExprs.applyOrElse(
idx, throw new UnsupportedOperationException(s"unsupported operator $a")
Grouping(groupingCol)
And I don't quite get the meaning of the exception error message...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, if the value is out of boundary, I thought we should not continue the conversion. After rethinking this, users might call grouping_id() inside such a function. Maybe we should not throw any exception. How about changing it to
groupByExprs.lift(idx).map(Grouping).getOrElse(a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala
@liancheng I ran the suite in my local laptop. All the related tests works. However, I hit multiple regression failure that was introduced by another PR: #11466 I will submit a separate PR to fix the issues. Thank you for your reviews! |
Test build #52508 has finished for PR 11283 at commit
|
Thanks for running the test. Could you please name these regressions? I couldn't find them. Seems that currently Jenkins PR builder looks good. |
This LGTM now. Merging to master. Thanks! |
A final thought about this PR: For SQL generation for grouping sets, we have to depend on a set of implicit assumptions related to implementation details of specific analysis rules, which makes the implementation tend to be fragile. I think maybe we can add an auxiliary logical plan operator Haven't thought thoroughly about this, but with the help of these annotations, I'd expect it to be easier to recognize patterns corresponding to plans / expressions like grouping set, grouping, multi-distinct aggregation etc.. |
Yeah, definitely, it helps a lot. Otherwise, toSQL needs to identify the pattern and convert it back using a few assumptions. The pattern and assumptions we made depends on the implementation of our analyzer rules. Before we finalize the design, I will first stop working on the SQL generation of |
…ing Sets #### What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` #### How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#11283 from gatorsmile/groupingSetsToSQL.
What changes were proposed in this pull request?
This PR is for supporting SQL generation for cube, rollup and grouping sets.
For example, a query using rollup:
Original logical plan:
Converted SQL:
How was the this patch tested?
Added eight test cases in
LogicalPlanToSQLSuite
.