[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

aray · 2016-03-08T19:27:34Z

What changes were proposed in this pull request?

The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.

How was this patch tested?

Additional unit tests in DataFramePivotSuite and manual larger scale testing.

…ons and at best doesent do much data reduction.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

…peed things up

aray · 2016-03-08T19:30:17Z

cc @rxin and @yhuai since you two were involved in the original version

SparkQA · 2016-03-08T19:33:57Z

Test build #52681 has finished for PR 11583 at commit bffc7aa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T21:16:59Z

Test build #52682 has finished for PR 11583 at commit 359a374.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2016-03-10T17:12:07Z

Here are some quick benchmark results on a ~1 million row dataset

yhuai · 2016-03-10T19:17:14Z

@aray Thank you for working on it. The results look very cool! I may not have time to review this PR within this week. I will try to find time next week to take a look.

aray · 2016-03-23T19:50:00Z

@yhuai do you have time this week to look at this patch?

yhuai · 2016-03-23T19:53:52Z

Sorry.

I will review this one this week.

yhuai · 2016-03-28T01:45:40Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala

+  override lazy val inputAggBufferAttributes: Seq[AttributeReference] =
+    aggBufferAttributes.map(_.newInstance())
+
+  override lazy val inputTypes: Seq[AbstractDataType] = children.map(_.dataType)


How about we use inputTypes to ask the analyzer to do type casting. So, if there is a value column that has an invalid data type, the analyzer will complain.

I'm not sure what you mean by this, but no casting is needed.

* Remove threshold of 10 for pivot values in Analyzer * Change updateFunction into a partial function so support can be checked without try/catch * Scaladoc for PivotFirst * Move children, inputTypes, nullable, and dataType to beginning * Added comments

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2016-04-18T17:36:34Z

Test build #56095 has finished for PR 11583 at commit 32e97a2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-18T19:35:50Z

Test build #56103 has finished for PR 11583 at commit 1723046.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2016-04-18T20:07:10Z

@yhuai I've addressed all your comments, ready for you to take another look. Sorry for the delay.

aray · 2016-05-02T15:52:32Z

@yhuai can we get this merged for 2.0?

yhuai · 2016-05-02T16:21:17Z

test this please

yhuai · 2016-05-02T16:38:28Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala

+
+
+  override lazy val aggBufferAttributes: Seq[AttributeReference] =
+    pivotIndex.toList.sortBy(_._2).map(kv => AttributeReference(kv._1.toString, valueDataType)())


How about we avoid of using lazy val for aggBufferAttributes, aggBufferSchema, and inputAggBufferAttributes?

yhuai · 2016-05-02T16:40:59Z

@aray This PR looks good. I will merge this after it passes tests. Can you send out a follow up pr to address my comments?

aray · 2016-05-02T17:01:24Z

Sure, will do tonight.

SparkQA · 2016-05-02T17:53:54Z

Test build #57537 has finished for PR 11583 at commit 1723046.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-02T18:11:57Z

Merging to master and 2.0 branch.

…es with two phase aggregation ## What changes were proposed in this pull request? The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them. I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column. ## How was this patch tested? Additional unit tests in DataFramePivotSuite and manual larger scale testing. Author: Andrew Ray <[email protected]> Closes #11583 from aray/fast-pivot. (cherry picked from commit 9927441) Signed-off-by: Yin Huai <[email protected]>

…stinct values with two phase aggregation ## What changes were proposed in this pull request? This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage. ## How was this patch tested? Existing unit tests and additional unit tests. Author: Andrew Ray <[email protected]> Closes #12861 from aray/fast-pivot-follow-up. (cherry picked from commit d8f528c) Signed-off-by: Yin Huai <[email protected]>

…stinct values with two phase aggregation ## What changes were proposed in this pull request? This is a follow up PR for apache#11583. It makes 3 lazy vals into just vals and adds unit test coverage. ## How was this patch tested? Existing unit tests and additional unit tests. Author: Andrew Ray <[email protected]> Closes apache#12861 from aray/fast-pivot-follow-up.

aray added 12 commits February 11, 2016 14:49

sketch

75a101a

Merge branch 'master' of https://github.com/apache/spark into fast-pivot

e42cb36

working version

b65cfb2

Support other datatypes

adbcd1b

working analyzer rule

4b33b47

Some cleanup and unit tests

d0b0b2f

disable partial agg since it may cause data expansion in some situati…

7a662ba

…ons and at best doesent do much data reduction.

Merge branch 'master' of https://github.com/apache/spark into fast-pivot

66e69db

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Unit tests and code restructuring

b3ccf61

fix decimal unit test and remove map from case class constructor to s…

bc0571d

…peed things up

support for multiple aggregations and other cleanup

cc9f49f

cleanup import

bffc7aa

scalastyle

359a374

yhuai reviewed Mar 28, 2016
View reviewed changes

Addresses code review comments

28bbbef

* Remove threshold of 10 for pivot values in Analyzer * Change updateFunction into a partial function so support can be checked without try/catch * Scaladoc for PivotFirst * Move children, inputTypes, nullable, and dataType to beginning * Added comments

Merge branch 'master' of https://github.com/apache/spark into fast-pivot

32e97a2

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

fixes scalastyle issue

1723046

yhuai reviewed May 2, 2016
View reviewed changes

asfgit closed this in 9927441 May 2, 2016

aray mentioned this pull request May 3, 2016

[SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation #12861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

aray commented Mar 8, 2016

aray commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

aray commented Mar 10, 2016

yhuai commented Mar 10, 2016

aray commented Mar 23, 2016

yhuai commented Mar 23, 2016

yhuai Mar 28, 2016

aray Apr 18, 2016

SparkQA commented Apr 18, 2016

SparkQA commented Apr 18, 2016

aray commented Apr 18, 2016

aray commented May 2, 2016

yhuai commented May 2, 2016

yhuai May 2, 2016

yhuai commented May 2, 2016

aray commented May 2, 2016

SparkQA commented May 2, 2016

yhuai commented May 2, 2016



		override lazy val aggBufferAttributes: Seq[AttributeReference] =
		pivotIndex.toList.sortBy(_._2).map(kv => AttributeReference(kv._1.toString, valueDataType)())

[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

Conversation

aray commented Mar 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

aray commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

aray commented Mar 10, 2016

yhuai commented Mar 10, 2016

aray commented Mar 23, 2016

yhuai commented Mar 23, 2016

yhuai Mar 28, 2016

Choose a reason for hiding this comment

aray Apr 18, 2016

Choose a reason for hiding this comment

SparkQA commented Apr 18, 2016

SparkQA commented Apr 18, 2016

aray commented Apr 18, 2016

aray commented May 2, 2016

yhuai commented May 2, 2016

yhuai May 2, 2016

Choose a reason for hiding this comment

yhuai commented May 2, 2016

aray commented May 2, 2016

SparkQA commented May 2, 2016

yhuai commented May 2, 2016