Simplify aggregation push-down #153

liancheng · 2017-12-22T05:53:07Z

This PR tries to simplify/refactor the aggregation push-down code path in TiStrategy to improve readability and maintainability by:

Use more Scala idiomatic coding style

In some cases, it may introduce extra expression traversals. It shouldn't be an issue since the number of expressions is quite limited but feel free to disagree.
Unify Average push-down

Instead of special-casing Average in groupAggregateProjection, we may want to convert Averages into Counts and Sums earlier in the form of Catalyst expressions so that we can remove the clumsy code paths involving avgPushdownRewriteMap, avgFinalRewriteMap, and building partial/final aggregate expressions.

Opening this PR early for discussion and to make sure it makes Travis happy. If people would like to split this work into a few separate PRs for easier code review and future Git blaming, I'd be happy to do that.

This change is

liancheng · 2017-12-22T05:54:07Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

-        originalAggExpr.isDistinct,
-        originalAggExpr.resultId
-      )
-


This local function is replaced by inlined .copy() calls.

liancheng · 2017-12-22T05:57:33Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

+    val toAlias: AggregateExpression => Alias = {
+      lazy val deterministicAggAliases = aggregateExpressions.collect {
+        case e if e.deterministic => e -> Alias(e.canonicalized, e.toString())()
+      }.toMap


Hmm, no need to be lazy, updating...

liancheng · 2017-12-22T05:58:52Z

src/main/scala/org/apache/spark/sql/TiContext.scala

@@ -35,7 +34,6 @@ class TiContext(val session: SparkSession) extends Serializable with Logging {

  TiUtils.sessionInitialize(session, tiSession)

-


Accidental whitespace changes introduced by scalafmt.

liancheng · 2017-12-22T06:29:59Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

+      request.addOrderByItem(
+        TiByItem.create(
+          BasicExpression.convertToTiExpr(order.child).get,
+          order.direction.sql.equalsIgnoreCase("DESC")


Ensured below that sortOrder will never be null.

liancheng · 2017-12-22T09:06:12Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

+        (aggregateExpressions ++ extraAggregateExpressions).distinct,
+        rewrittenResultExpressions,
+        child
+      )


Made TiAggregation always return results w/o any occurrences of Average. I think this is reasonable since the purpose of TiAggregation is to extract aggregate functions that can be pushed down to TiKV, while Average is not directly pushable.

Extracting the Average rewriting code path into a separate method also LGTM.

Returning the original aggregateExpressions which may still contain an average expression, passing it directly to method groupAggregateProjection(line 388) will cause an IllegalStateException.

I suggest remove the average expression in the final aggregateExpressions

My bad, nice catch!

liancheng · 2017-12-22T09:13:49Z

@Novemser This is reviewable now.

liancheng · 2017-12-22T09:30:08Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

-    // If sortOrder is not null, limit must be greater than 0
-    if (limit < 0 || (sortOrder == null && limit == 0)) {
+    // If sortOrder is empty, limit must be greater than 0
+    if (limit < 0 || (sortOrder.isEmpty && limit == 0)) {


Did I miss something here? I don't think SortOrder can be null in Spark.

Sorry it's my fault, originally takeOrderedAndProject is called in collectLimit, and an explicit null sortOrder will be passed, I removed that call in collectLimit but forgot to remove these checks here, thanks for your dedicated review!

Novemser · 2017-12-25T06:40:33Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

+        case e: Sum     => aggExpr.copy(aggregateFunction = e.copy(child = partialResultRef))
+        case e: First   => aggExpr.copy(aggregateFunction = e.copy(child = partialResultRef))
+        case _: Count   => aggExpr.copy(aggregateFunction = Sum(partialResultRef))
+        case _: Average => throw new IllegalStateException("All AVGs should have been rewritten.")


It seems rewritten logic of Average to sum/count doesn't go as expected.

Such querys like:

select avg(any_col) from any_table

will trigger IllegalStateException.

I think it's fixed now (but I cannot test it...).

Sorry for the inconvenience, I shall do tests for you.

On the other hand, I'm happy to code w/o writing any tests, LOL

Novemser

Need to check average rewrite logic according to comment above.

liancheng · 2017-12-26T00:22:02Z

@Novemser Updated according to the comments. Thanks!

liancheng · 2017-12-26T00:28:30Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

-      }
+    def aliasPushedPartialResult(e: AggregateExpression): Alias = {
+      deterministicAggAliases.getOrElse(e, Alias(e, e.toString())())
+    }


@Novemser @ilovesoup The only non-deterministic aggregate function handled by TiStrategy is First, which doesn't seem to be necessary to be special-cased here. Any counter examples?

Seems true, currently I cannot find any counter example.

Do we want to remove this logic? Was trying to add comments to explain why determinism is a concern here and then realized that it's not...

There seems to be some issue with deterministicAggAliases, originally aliasMap was designed to eliminate duplicated aggregations, e.g.
SQL:

select count(col+1),count(1+col) from any_table

count(col + 1) and count(1 + col) will be reduced to only one canonicalized expression like count(col + 1), and push this canonicalized expression down to TiKV, but current implementation doesn't serve the same as the original design(could be verified by executing the above sql).
Similar issue has been discussed here #45
@liancheng We may need some further discussion on this topic. : )

Ah, my bad, it should be:

deterministicAggAliases.getOrElse(e.canonicalized, Alias(e, e.toString())())

Very pleased to receive your response so fast, BTW I think determinism logic could be simplified as you suggested : )

I can open a separate PR for this change. This one is already too big to review and track.

…in s

Novemser · 2017-12-27T12:05:13Z

Test report:
TPCH
Tests run: 21 Tests succeeded: 21 Tests failed: 0 Tests skipped: 0

DAG
Tests run: 6841 Tests succeeded: 3766 Tests failed: 8 Tests skipped: 3067

Result is as expected

Novemser · 2017-12-27T12:09:48Z

This PR LGTM.

@zhexuany @ilovesoup PTAL

liancheng · 2017-12-27T18:37:15Z

src/main/scala/org/apache/spark/sql/TiStrategy.scala

+    projects
+      .map { _.toAttribute.name }
+      .map { TiColumnRef.create }
+      .foreach { dagReq.addRequiredColumn }


I see, I missed the addRequiredColumn call whiles cleaning up the original code path, thanks for fixing it!

Not your problem, this code snippet was just introduced into master branch few hours ago via this PR #143, and I merged that change into this branch.

liancheng · 2017-12-27T19:06:22Z

@Novemser I made a minor refactoring in my last commit, which is arguably better (in the sense that eliminating mutable collections), depending on the coding style PingCAP prefers.

In Spark (or more specifically, Catalyst), we tend not to use mutable states (var or mutable collections) whenever possible for non-critical code paths to minimize side effects. Part of the reason why I neglected the addRequiredColumns() call was that TiDAGRequest is mutable and it's hard to track what side effects had happened.

If TiDAGRequest were immutable and always returned a new instance when more query components are added, you'll have to provide a new variable (and hence a new name) to the newly created instance (dagReqWithRequiredColumns, dagReqWithFilters, etc.) and then it would be pretty hard to neglect part of the logic while doing refactoring.

Again, this is quite subjective and feel free to disagree and revert :)

Use JavaConverters instead for easier tracking Scala/Java collections conversion.

Novemser · 2017-12-28T03:38:40Z

@liancheng I totally agree with you on the thought of not to use mutable states here and pretty appreciate your elegant solution of using immutable states. As for TiDAGRequest, I'm quite in favor of your idea personally, feel free to open another PR to do such refactor if you like : )

Thanks again for your dedicated work!

Novemser · 2017-12-28T03:40:02Z

I reran the test, results are follows

Test report:
TPCH
Tests run: 21 Tests succeeded: 21 Tests failed: 0 Tests skipped: 0

DAG
Tests run: 6841 Tests succeeded: 3766 Tests failed: 8 Tests skipped: 3067

All results are as expected

Novemser · 2017-12-28T03:43:03Z

BTW it's interesting and glad to serve like a manual only judge here, LOL

This PR still need further review.

liancheng · 2018-01-02T02:42:09Z

Please let me know if there's anything that I'm expected to do to get this moving forward. One of the things I can think of is to split this single refactoring PR into smaller ones for easier tracking and reducing potential conflicts with other important outstanding PRs.

zhexuany · 2018-01-02T04:26:57Z

core/src/main/scala/org/apache/spark/sql/TiStrategy.scala

    filters: Seq[Expression],
    source: TiDBRelation,
    dagRequest: TiDAGRequest = new TiDAGRequest(pushDownType(), timeZoneOffset())
  ): TiDAGRequest = {
-    val tiFilters: Seq[TiExpr] = filters.collect { case BasicExpression(expr) => expr }
+    val tiFilters = filters.collect { case BasicExpression(expr) => expr }.asJava


please add type information back.

Addressed. Thanks!

zhexuany · 2018-01-02T04:30:43Z

core/src/main/scala/org/apache/spark/sql/TiStrategy.scala

-              order.direction.sql.equalsIgnoreCase("DESC")
-            )
+  private def addSortOrder(request: TiDAGRequest, sortOrder: Seq[SortOrder]): Unit =
+    sortOrder.foreach { order: SortOrder =>


why do we delete null value check here? If sortOrder is null, foreach could throw an NPE here.

I just saw you change null to nil when you call this function.

Yea, please see this comment thread for more details.

Yes. change to nil means empty list. Null value check is unnecessary.

liancheng · 2018-01-02T18:31:37Z

@Novemser @zhexuany Thanks for the detailed review!

Novemser · 2018-01-03T02:23:14Z

@liancheng We appreciate for your excellent work! : )

liancheng commented Dec 22, 2017

View reviewed changes

Novemser added the work in progress label Dec 22, 2017

liancheng commented Dec 22, 2017

View reviewed changes

liancheng force-pushed the simplify branch from d6be34b to 73ab35b Compare December 22, 2017 08:56

liancheng commented Dec 22, 2017

View reviewed changes

liancheng changed the title ~~WIP: Simplify aggregation push-down~~ Simplify aggregation push-down Dec 22, 2017

liancheng force-pushed the simplify branch from 07e46b9 to c2ac222 Compare December 22, 2017 09:26

liancheng commented Dec 22, 2017

View reviewed changes

Novemser reviewed Dec 25, 2017

View reviewed changes

Novemser suggested changes Dec 25, 2017

View reviewed changes

liancheng commented Dec 26, 2017

View reviewed changes

Novemser added reviewable and removed work in progress labels Dec 27, 2017

liancheng force-pushed the simplify branch from 555566d to e3552a3 Compare December 27, 2017 05:47

liancheng added 12 commits December 26, 2017 21:48

Replace the newAggregate local method with .copy()

cffc6b0

Simplify aggregation push-down

cebd5a4

Remove unnecessary lazy val

465e56c

Styling changes and narrow down method scopes

16c8a8c

Use JavaConverters instead of JavaConversions

15ebea9

Renaming

f502758

Rewrite AVG functions inside TiAggregation

5d211cc

Remove the old AVG rewriting code path

dacc691

Check SortOrder for emptyness instead of null

47d7da3

Remove unsed imports

2a02aa7

Aggregate expression list extracted by TiAggregation should not conta…

d8b34b6

…in s

Canonicalize pushed aggregate expressions

1fd4aaa

Novemser added reviewable and removed work in progress labels Dec 27, 2017

Novemser previously approved these changes Dec 27, 2017

View reviewed changes

Beautify foreach loop

dcbf6b7

Resolve conflict

ef94384

liancheng commented Dec 27, 2017

View reviewed changes

Minor refactoring

ab9247b

Eliminate JavaConversions usage

b3662a2

Use JavaConverters instead for easier tracking Scala/Java collections conversion.

Merge with master

5d4ea07

zhexuany added LGT1 and removed LGT2 labels Dec 28, 2017

Resolve conflict

b5bae27

zhexuany reviewed Jan 2, 2018

View reviewed changes

Address PR comments

737fd44

zhexuany approved these changes Jan 2, 2018

View reviewed changes

zhexuany merged commit 93c3c27 into pingcap:master Jan 2, 2018

Novemser added the LGT2 label Jan 3, 2018

wfxxh pushed a commit to wanfangdata/tispark that referenced this pull request Jun 30, 2023

Simplify aggregation push-down logic (pingcap#153)

8d75ea6

		@@ -35,7 +34,6 @@ class TiContext(val session: SparkSession) extends Serializable with Logging {

		TiUtils.sessionInitialize(session, tiSession)

Simplify aggregation push-down #153

Simplify aggregation push-down #153

Conversation

liancheng commented Dec 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Dec 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Novemser Dec 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Novemser left a comment

Choose a reason for hiding this comment

liancheng commented Dec 26, 2017

liancheng Dec 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Novemser Dec 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Novemser commented Dec 27, 2017 • edited Loading

Novemser commented Dec 27, 2017 • edited Loading

liancheng Dec 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Dec 27, 2017

Novemser commented Dec 28, 2017

Novemser commented Dec 28, 2017 • edited Loading

Novemser commented Dec 28, 2017

liancheng commented Jan 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Jan 2, 2018

Novemser commented Jan 3, 2018

liancheng commented Dec 22, 2017 •

edited

Loading

Novemser Dec 25, 2017 •

edited

Loading

liancheng Dec 26, 2017 •

edited

Loading

Novemser Dec 27, 2017 •

edited

Loading

Novemser commented Dec 27, 2017 •

edited

Loading

Novemser commented Dec 27, 2017 •

edited

Loading

liancheng Dec 27, 2017 •

edited

Loading

Novemser commented Dec 28, 2017 •

edited

Loading

liancheng commented Jan 2, 2018 •

edited

Loading