[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API #22484

wangyum · 2018-09-20T07:34:09Z

What changes were proposed in this pull request?

This PR does 2 things:

Add a new trait(SqlBasedBenchmark) to better support Dataset and DataFrame API.
Refactor AggregateBenchmark to use main method. Generate benchmark result:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark"

How was this patch tested?

manual tests

wangyum · 2018-09-20T10:04:45Z

sql/core/benchmarks/AggregateBenchmark-results.txt

+================================================================================================
+stat functions
+================================================================================================
+


@davies Do you know how to generate there benchmark:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

Lines 70 to 78 in 3c3eebc

Using ImperativeAggregate (as implemented in Spark 1.6):

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz

stddev: Avg Time(ms) Avg Rate(M/s) Relative Rate

-------------------------------------------------------------------------------

stddev w/o codegen 2019.04 10.39 1.00 X

stddev w codegen 2097.29 10.00 0.96 X

kurtosis w/o codegen 2108.99 9.94 0.96 X

kurtosis w codegen 2090.69 10.03 0.97 X

SparkQA · 2018-09-20T11:33:21Z

Test build #96338 has finished for PR 22484 at commit 649f296.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

gengliangwang · 2018-09-22T16:29:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

+  override def benchmark(): Unit = {
+    runBenchmark("aggregate without grouping") {
+      val N = 500L << 22
+      runBenchmark("agg w/o group", N) {


The runBenchmark here is different from the on in line 48, but they have the same name. We should have a different name.

Yes. Do you have a suggested name?

Well I don't a good name in mind. How about make the method runBenchmark of RunBenchmarkWithCodegen overriding the one in BenchmarkBase?

SparkQA · 2018-09-22T19:30:13Z

Test build #96473 has finished for PR 22484 at commit 42230b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait RunBenchmarkWithCodegen extends BenchmarkBase

wangyum · 2018-09-24T03:46:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/RunBenchmarkWithCodegen.scala

+ * Common base trait for micro benchmarks that are supposed to run standalone (i.e. not together
+ * with other test suites).
+ */
+trait RunBenchmarkWithCodegen extends BenchmarkBase {


How about RunBenchmarkWithCodegen -> SqlBaseBenchmark?

wangyum · 2018-09-24T03:47:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/RunBenchmarkWithCodegen.scala

+  }
+
+  /** Runs function `f` with whole stage codegen on and off. */
+  def runBenchmark(name: String, cardinality: Long)(f: => Unit): Unit = {


How about runBenchmark -> runBenchmarkWithCodegen ?

SparkQA · 2018-09-24T07:05:01Z

Test build #96500 has finished for PR 22484 at commit 2d778a4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SqlBasedBenchmark extends BenchmarkBase

dilipbiswal · 2018-09-24T07:11:45Z

retest this please

SparkQA · 2018-09-24T11:08:40Z

Test build #96502 has finished for PR 22484 at commit 2d778a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SqlBasedBenchmark extends BenchmarkBase

cloud-fan · 2018-09-24T13:11:47Z

LGTM, cc @dongjoon-hyun for sign-off

dongjoon-hyun · 2018-09-24T17:05:59Z

@wangyum . Could you make the title and description up-to-date for this PR content? Also, please update JIRA title and description, too.

dongjoon-hyun · 2018-09-24T18:12:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-    val N = 5 << 20
+      benchmark.addCase("codegen = T hugeMethodLimit = 1500") { iter =>
+        spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
+        spark.conf.set(SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key, "1500")


Although this is not a problem of this refactoring, this test suite seems to be unhealthy because the configuration from the previous benchmark is propagated to the next benchmark.

Can we fix this test suite to use withSQLConf?

dongjoon-hyun · 2018-09-24T23:59:48Z

Thank you for updating, @wangyum .

dongjoon-hyun · 2018-09-25T00:03:24Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-        spark.conf.set("spark.sql.codegen.wholeStage", "false")
-        f()
+      benchmark.addCase(s"codegen = F", numIters = 2) { _ =>
+        withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> false.toString) {


"false" instead of false.toString?

When we use Seq(true, false).foreach { value =>, we usually do s"$value". But, for this, I think "false" is the simplest and the best.

dongjoon-hyun · 2018-09-25T00:05:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

+      benchmark.addCase(s"codegen = T hashmap = F", numIters = 3) { _ =>
+        withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> true.toString,
+          SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> false.toString,
+          "spark.sql.codegen.aggregate.map.vectorized.enable" -> false.toString) {


withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> "false", "spark.sql.codegen.aggregate.map.vectorized.enable" -> "false") {

@wangyum . This one is also for indentation. Please note that withSQLConf( is beyond the first configuration.

Do you mean change

withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> "false", "spark.sql.codegen.aggregate.map.vectorized.enable" -> "false") { f() }

to

withSQLConf( SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true", SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> "false", "spark.sql.codegen.aggregate.map.vectorized.enable" -> "false") { f() }

?

dongjoon-hyun · 2018-09-25T00:05:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

+      benchmark.addCase(s"codegen = T hashmap = T", numIters = 5) { _ =>
+        withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> true.toString,
+          SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> true.toString,
+          "spark.sql.codegen.aggregate.map.vectorized.enable" -> true.toString) {


dongjoon-hyun · 2018-09-25T00:06:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-        spark.conf.set("spark.sql.codegen.wholeStage", value = false)
-        f()
+      benchmark.addCase(s"codegen = F", numIters = 2) { _ =>
+        withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> false.toString) {


"false" instead of false.toString.

dongjoon-hyun · 2018-09-25T00:06:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

+      benchmark.addCase(s"codegen = T hashmap = F", numIters = 3) { _ =>
+        withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> true.toString,
+          SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> false.toString,
+          "spark.sql.codegen.aggregate.map.vectorized.enable" -> false.toString) {


dongjoon-hyun · 2018-09-25T02:30:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala

+   * Sets all SQL configurations specified in `pairs`, calls `f`, and then restores all SQL
+   * configurations.
+   */
+  protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {


Shall we avoid duplicating the existing logic withSQLConf? Let me try to fix.

@wangyum . Thank you for waiting.
Since SPARK-25534 is merged, could you use SQLHelper.withSQLConf instead?

Yes, I will do it now.

@dongjoon-hyun I finished it.

dongjoon-hyun · 2018-09-25T21:56:35Z

I made an official PR for withSQLConf SPARK-25534 which reduces the number of withSQLConf from 6 to 3. The three were the minimal different implementations.

SparkQA · 2018-09-26T09:04:05Z

Test build #96616 has finished for PR 22484 at commit 0fae54d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SqlBasedBenchmark extends BenchmarkBase with SQLHelper

wangyum · 2018-09-26T09:09:40Z

retest this please

SparkQA · 2018-09-26T13:20:37Z

Test build #96624 has finished for PR 22484 at commit 0fae54d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SqlBasedBenchmark extends BenchmarkBase with SQLHelper

dongjoon-hyun · 2018-09-27T05:43:16Z

@wangyum Could you review and merge wangyum#12 , too?

dongjoon-hyun · 2018-09-27T05:49:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-    def f(): Unit = sparkSession.range(N).selectExpr("id", "cast(id & 1023 as string) as k")
-      .groupBy("k").count().collect()
+      benchmark.addCase(s"codegen = F", numIters = 2) { _ =>
+        spark.conf.set("spark.sql.codegen.wholeStage", "false")


Shall we remove this redundant line 148?

dongjoon-hyun · 2018-09-27T05:52:52Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-      sparkSession.conf.set("spark.sql.codegen.aggregate.map.vectorized.enable", "false")
-      f()
-    }
+      benchmark.addCase(s"codegen = F", numIters = 2) { _ =>


s"codegen = F" -> "codegen = F"?

Thanks @dongjoon-hyun I plan add EmptyInterpolatedStringChecker to scalastyle-config.xml to avoid this issue: SPARK-25553

dongjoon-hyun · 2018-09-27T05:53:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

-      sparkSession.conf.set("spark.sql.codegen.aggregate.map.vectorized.enable", "true")
-      f()
-    }
+      benchmark.addCase(s"codegen = T hashmap = F", numIters = 3) { _ =>


s"codegen = T hashmap = F" -> "codegen = T hashmap = F"

Could you fix all instances like this?

SparkQA · 2018-09-27T07:05:01Z

Test build #96667 has finished for PR 22484 at commit 3439992.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-27T07:05:02Z

Test build #96666 has finished for PR 22484 at commit 45add92.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-27T08:30:26Z

retest this please

SparkQA · 2018-09-27T12:10:47Z

Test build #96686 has finished for PR 22484 at commit 3439992.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-28T08:19:56Z

@dongjoon-hyun Other refactorings are waiting for this commit.

dongjoon-hyun · 2018-09-28T22:33:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala

+/**
+ * Common base trait to run benchmark with the Dataset and DataFrame API.
+ */
+trait SqlBasedBenchmark extends BenchmarkBase with SQLHelper {


@wangyum and @gengliangwang .

What is the future plan for the usage of both SqlBasedBenchmark and BenchmarkWithCodegen? I'm wondering what is the criteria to choose each trait.

I think we can remove BenchmarkWithCodegen after all refactor finished.

So, if @gengliangwang agree with that, SqlBasedBenchmark is another refactoring (renaming and improvement) like [SPARK-25499][TEST] Refactor BenchmarkBase and Benchmark. Could you do that in a separate PR in advance?

Actually I don't think the the name SqlBasedBenchmark is not appropriate..From the naming we can't tell it is about benchmarking with/without whole codegen. I will try to come up with a better name. Or we can discuss in this thread.

@dongjoon-hyun in #22522 I feel that it would be better to have a example refactoring, thus we can see how the new trait is used.
We can move back to #22522 . I am OK either way.

How about CodegenBenchmarkBase ? This is the best I can think of.. @wangyum @dongjoon-hyun @cloud-fan

Maybe we can add more common functions in the future. e.g. runBenchmarkWithCodegen, runBenchmarkWithParquetPushDown, runBenchmarkWithOrcPushDown...

Then each function can be in different trait...I don't think that runBenchmarkWithCodegen has much in common with runBenchmarkWithParquetPushDown.

Thank you, @gengliangwang and @wangyum . Let me think about this again.

For the naming, let's keep the current one for now.

dongjoon-hyun · 2018-10-01T06:09:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala

+ */
+trait SqlBasedBenchmark extends BenchmarkBase with SQLHelper {
+
+  val spark: SparkSession = getSparkSession


val spark -> protected val spark

dongjoon-hyun · 2018-10-01T06:09:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala

+  }
+
+  /** Runs function `f` with whole stage codegen on and off. */
+  def runBenchmarkWithCodegen(name: String, cardinality: Long)(f: => Unit): Unit = {


This should be final def runBenchmarkWithCodegen instead of def runBenchmarkWithCodegen.

dongjoon-hyun · 2018-10-01T06:26:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala

+      .getOrCreate()
+  }
+
+  /** Runs function `f` with whole stage codegen on and off. */


Can we use codegenBenchmark instead? runBenchmarkWithCodegen looks like an extension of runBenchmark. It's more like bitEncodingBenchmark or sortBenchmark.

SparkQA · 2018-10-01T10:32:01Z

Test build #96811 has finished for PR 22484 at commit 6c46ad5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2018-10-01T14:31:56Z

Merged to master.

dongjoon-hyun · 2018-10-01T14:51:54Z

Thank you, @wangyum , @cloud-fan and @gengliangwang !

## What changes were proposed in this pull request? Remove `BenchmarkWithCodegen` as we don't use it anymore. More details: #22484 (comment) ## How was this patch tested? N/A Closes #22985 from wangyum/SPARK-25510. Authored-by: Yuming Wang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…a new trait to better support Dataset and DataFrame API ## What changes were proposed in this pull request? This PR does 2 things: 1. Add a new trait(`SqlBasedBenchmark`) to better support Dataset and DataFrame API. 2. Refactor `AggregateBenchmark` to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark" ``` ## How was this patch tested? manual tests Closes apache#22484 from wangyum/SPARK-25476. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? Remove `BenchmarkWithCodegen` as we don't use it anymore. More details: apache#22484 (comment) ## How was this patch tested? N/A Closes apache#22985 from wangyum/SPARK-25510. Authored-by: Yuming Wang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Refactor AggregateBenchmark

649f296

wangyum commented Sep 20, 2018

View reviewed changes

wangyum added 2 commits September 22, 2018 00:13

Merge remote-tracking branch 'upstream/master' into SPARK-25476

e0c6d09

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

Add RunBenchmarkWithCodegen

42230b6

gengliangwang reviewed Sep 22, 2018

View reviewed changes

gengliangwang mentioned this pull request Sep 22, 2018

[SPARK-25510][TEST] Create new trait replace BenchmarkWithCodegen #22522

Closed

wangyum commented Sep 24, 2018

View reviewed changes

wangyum mentioned this pull request Sep 24, 2018

[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method #22501

Closed

yucai mentioned this pull request Sep 24, 2018

[SPARK-25486][TEST] Refactor SortBenchmark to use main method #22495

Closed

Rename RunBenchmarkWithCodegen to SqlBasedBenchmark

2d778a4

dongjoon-hyun reviewed Sep 24, 2018

View reviewed changes

Add withSQLConf to SqlBasedBenchmark

536d4e9

dongjoon-hyun reviewed Sep 25, 2018

View reviewed changes

wangyum changed the title ~~[SPARK-25476][TEST] Refactor AggregateBenchmark to use main method~~ [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API Sep 25, 2018

wangyum added 2 commits September 25, 2018 08:11

true.toString -> "true" and false.toString -> "false"

f783099

Fix indentation

d5fecdc

dongjoon-hyun reviewed Sep 25, 2018

View reviewed changes

wangyum added 2 commits September 26, 2018 14:38

Merge remote-tracking branch 'upstream/master' into SPARK-25476

23f1bd8

use SQLHelper

0fae54d

dongjoon-hyun reviewed Sep 27, 2018

View reviewed changes

dongjoon-hyun and others added 2 commits September 27, 2018 13:55

Update result (#12)

45add92

Fix string variable

3439992

dongjoon-hyun reviewed Sep 28, 2018

View reviewed changes

dongjoon-hyun reviewed Oct 1, 2018

View reviewed changes

Address comments

6c46ad5

dongjoon-hyun approved these changes Oct 1, 2018

View reviewed changes

asfgit closed this in b96fd44 Oct 1, 2018

wangyum mentioned this pull request Nov 8, 2018

[SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen #22985

Closed

	Using ImperativeAggregate (as implemented in Spark 1.6):

	Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
	stddev: Avg Time(ms) Avg Rate(M/s) Relative Rate
	-------------------------------------------------------------------------------
	stddev w/o codegen 2019.04 10.39 1.00 X
	stddev w codegen 2097.29 10.00 0.96 X
	kurtosis w/o codegen 2108.99 9.94 0.96 X
	kurtosis w codegen 2090.69 10.03 0.97 X

[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API #22484

[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API #22484

Conversation

wangyum commented Sep 20, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Sep 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 22, 2018

wangyum Sep 24, 2018 • edited Loading

Choose a reason for hiding this comment

wangyum Sep 24, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Sep 24, 2018

dilipbiswal commented Sep 24, 2018

SparkQA commented Sep 24, 2018

cloud-fan commented Sep 24, 2018

dongjoon-hyun commented Sep 24, 2018

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 25, 2018

SparkQA commented Sep 26, 2018

wangyum commented Sep 26, 2018

SparkQA commented Sep 26, 2018

dongjoon-hyun commented Sep 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 27, 2018

SparkQA commented Sep 27, 2018

wangyum commented Sep 27, 2018

SparkQA commented Sep 27, 2018

wangyum commented Sep 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 1, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 1, 2018

dongjoon-hyun commented Oct 1, 2018

wangyum commented Sep 20, 2018 •

edited

Loading

wangyum Sep 24, 2018 •

edited

Loading

wangyum Sep 24, 2018 •

edited

Loading