[SPARK-24206][SQL] Improve DataSource read benchmark code #21266

maropu · 2018-05-08T05:38:10Z

What changes were proposed in this pull request?

This pr added benchmark code DataSourceReadBenchmark for orc, paruqet, csv, and json based on the existing ParquetReadBenchmark and OrcReadBenchmark.

How was this patch tested?

N/A

maropu · 2018-05-08T05:38:48Z

I'll add benchmark results just after #21070 merged.

maropu · 2018-05-08T05:40:10Z

Also, I'll make a follow-up pr for pushdown benchmarks;
master...maropu:UpdateParquetBenchmark

maropu · 2018-05-08T05:40:37Z

@gatorsmile @dongjoon-hyun

SparkQA · 2018-05-08T05:44:48Z

Test build #90359 has finished for PR 21266 at commit 09b3920.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-08T16:09:25Z

...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala

@@ -1,339 +0,0 @@
-/*


@maropu . Since you are merging ParquetReadBenchmark and OrcReadBenchmark benchmarks, let's remove OrcReadBenchmark.

I feel we still need to OrcReadBenchmark to compare native orc with Hive built-in orc?

I see. Never mind. I deleted the previous comment. The scope is only testing native orc here.

ok, thanks!

SparkQA · 2018-05-09T02:26:43Z

Test build #90392 has finished for PR 21266 at commit 2813706.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-10T04:24:59Z

Test build #90437 has finished for PR 21266 at commit 8aedbf0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-10T04:42:38Z

Test build #90438 has finished for PR 21266 at commit 1d93d99.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-14T08:51:45Z

Test build #90572 has finished for PR 21266 at commit fc96adb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-14T09:36:50Z

retest this please

SparkQA · 2018-05-14T12:48:41Z

Test build #90579 has finished for PR 21266 at commit fc96adb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-15T06:46:40Z

sql/core/src/test/scala/org/apache/spark/sql/DataSourceReadBenchmark.scala

+/**
+ * Benchmark to measure data source read performance.
+ * To run this:
+ *  spark-submit --class <this class> <spark sql test jar>


I think that this type of comments should not be included by scala file in this directory.
If you want to put this comment, this scala file should be put into sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ where files are not translated to doc.

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: this [error] * spark-submit --class <this class> <spark sql test jar> [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: spark [error] * spark-submit --class <this class> <spark sql test jar> [error]

SparkQA · 2018-05-21T04:12:35Z

Test build #90875 has finished for PR 21266 at commit 3b6f541.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-21T04:31:28Z

Test build #90877 has finished for PR 21266 at commit fad31b7.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-21T07:05:02Z

Test build #90879 has finished for PR 21266 at commit d8c308f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-21T07:31:59Z

retest this please

SparkQA · 2018-05-21T11:23:56Z

Test build #90884 has finished for PR 21266 at commit d8c308f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-21T17:36:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala

+        }
+
+        sqlBenchmark.addCase("SQL ORC Vectorized") { _ =>
+          spark.sql("SELECT sum(id) FROM orcTable").collect()


Let us explicitly set these confs? Here, we are expecting the perf number when ORC_COPY_BATCH_TO_SPARK is set to false. Please also double check the other benchmarks and add the related confs too?

I checked that ORC_COPY_BATCH_TO_SPARK=false in other tests (I didn't find performance differences after explicitly setting false in line 50.
https://github.com/apache/spark/pull/21266/files#diff-ae11b49db05c9e6829cad071b112a742R50

gatorsmile · 2018-05-21T17:37:00Z

@maropu Great work! Thanks for helping this!

maropu · 2018-05-23T02:37:31Z

I'll update the benchmark results soon.

SparkQA · 2018-05-23T05:23:47Z

Test build #91012 has finished for PR 21266 at commit 5eab1a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-23T05:52:49Z

retest this please

SparkQA · 2018-05-23T07:05:01Z

Test build #91017 has finished for PR 21266 at commit 5eab1a5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-23T12:57:36Z

retest this please

SparkQA · 2018-05-23T15:46:42Z

Test build #91036 has finished for PR 21266 at commit 5eab1a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-23T16:11:50Z

retest this please

SparkQA · 2018-05-23T19:54:49Z

Test build #91049 has finished for PR 21266 at commit 5eab1a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-23T20:01:31Z

LGTM

Thanks! Merged to master.

dongjoon-hyun reviewed May 8, 2018

View reviewed changes

maropu force-pushed the DataSourceReadBenchmark branch from 8aedbf0 to 171e89a Compare May 10, 2018 04:23

maropu force-pushed the DataSourceReadBenchmark branch from 171e89a to 1d93d99 Compare May 10, 2018 04:26

maropu force-pushed the DataSourceReadBenchmark branch from 1d93d99 to fc96adb Compare May 14, 2018 08:35

kiszk reviewed May 15, 2018

View reviewed changes

maropu force-pushed the DataSourceReadBenchmark branch from 3b6f541 to fad31b7 Compare May 21, 2018 04:03

maropu force-pushed the DataSourceReadBenchmark branch from fad31b7 to d8c308f Compare May 21, 2018 04:15

gatorsmile reviewed May 21, 2018

View reviewed changes

maropu added 5 commits May 23, 2018 11:37

Fix

7c9d1a6

Fix

ca78b84

Fix

78a8ff4

Fix

1ad9e37

Fix

5eab1a5

maropu force-pushed the DataSourceReadBenchmark branch from 274aa8a to 5eab1a5 Compare May 23, 2018 02:37

asfgit closed this in 84557bc May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24206][SQL] Improve DataSource read benchmark code #21266

[SPARK-24206][SQL] Improve DataSource read benchmark code #21266

maropu commented May 8, 2018

maropu commented May 8, 2018

maropu commented May 8, 2018

maropu commented May 8, 2018

SparkQA commented May 8, 2018

dongjoon-hyun May 8, 2018

maropu May 9, 2018

dongjoon-hyun May 9, 2018 •

edited

Loading

maropu May 9, 2018

SparkQA commented May 9, 2018

SparkQA commented May 10, 2018

SparkQA commented May 10, 2018

SparkQA commented May 14, 2018

maropu commented May 14, 2018

SparkQA commented May 14, 2018

kiszk May 15, 2018

maropu May 17, 2018

SparkQA commented May 21, 2018

SparkQA commented May 21, 2018

SparkQA commented May 21, 2018

maropu commented May 21, 2018

SparkQA commented May 21, 2018

gatorsmile May 21, 2018

maropu May 23, 2018

maropu May 23, 2018

gatorsmile commented May 21, 2018

maropu commented May 23, 2018

SparkQA commented May 23, 2018

maropu commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

gatorsmile commented May 23, 2018

[SPARK-24206][SQL] Improve DataSource read benchmark code #21266

[SPARK-24206][SQL] Improve DataSource read benchmark code #21266

Conversation

maropu commented May 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

maropu commented May 8, 2018

maropu commented May 8, 2018

maropu commented May 8, 2018

SparkQA commented May 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 9, 2018

SparkQA commented May 10, 2018

SparkQA commented May 10, 2018

SparkQA commented May 14, 2018

maropu commented May 14, 2018

SparkQA commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 21, 2018

SparkQA commented May 21, 2018

SparkQA commented May 21, 2018

maropu commented May 21, 2018

SparkQA commented May 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 21, 2018

maropu commented May 23, 2018

SparkQA commented May 23, 2018

maropu commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

gatorsmile commented May 23, 2018

dongjoon-hyun May 9, 2018 •

edited

Loading