-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24206][SQL] Improve DataSource read benchmark code #21266
Conversation
I'll add benchmark results just after #21070 merged. |
Also, I'll make a follow-up pr for pushdown benchmarks; |
Test build #90359 has finished for PR 21266 at commit
|
@@ -1,339 +0,0 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu . Since you are merging ParquetReadBenchmark and OrcReadBenchmark benchmarks, let's remove OrcReadBenchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we still need to OrcReadBenchmark
to compare native orc
with Hive built-in orc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Never mind. I deleted the previous comment. The scope is only testing native orc
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
Test build #90392 has finished for PR 21266 at commit
|
8aedbf0
to
171e89a
Compare
Test build #90437 has finished for PR 21266 at commit
|
171e89a
to
1d93d99
Compare
Test build #90438 has finished for PR 21266 at commit
|
1d93d99
to
fc96adb
Compare
Test build #90572 has finished for PR 21266 at commit
|
retest this please |
Test build #90579 has finished for PR 21266 at commit
|
/** | ||
* Benchmark to measure data source read performance. | ||
* To run this: | ||
* spark-submit --class <this class> <spark sql test jar> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this type of comments should not be included by scala file in this directory.
If you want to put this comment, this scala file should be put into sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/
where files are not translated to doc.
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: this
[error] * spark-submit --class <this class> <spark sql test jar>
[error] ^
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: spark
[error] * spark-submit --class <this class> <spark sql test jar>
[error]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
3b6f541
to
fad31b7
Compare
Test build #90875 has finished for PR 21266 at commit
|
fad31b7
to
d8c308f
Compare
Test build #90877 has finished for PR 21266 at commit
|
Test build #90879 has finished for PR 21266 at commit
|
retest this please |
Test build #90884 has finished for PR 21266 at commit
|
} | ||
|
||
sqlBenchmark.addCase("SQL ORC Vectorized") { _ => | ||
spark.sql("SELECT sum(id) FROM orcTable").collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us explicitly set these confs? Here, we are expecting the perf number when ORC_COPY_BATCH_TO_SPARK
is set to false
. Please also double check the other benchmarks and add the related confs too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that ORC_COPY_BATCH_TO_SPARK
=false
in other tests (I didn't find performance differences after explicitly setting false
in line 50.
https://github.com/apache/spark/pull/21266/files#diff-ae11b49db05c9e6829cad071b112a742R50
@maropu Great work! Thanks for helping this! |
274aa8a
to
5eab1a5
Compare
I'll update the benchmark results soon. |
Test build #91012 has finished for PR 21266 at commit
|
retest this please |
Test build #91017 has finished for PR 21266 at commit
|
retest this please |
Test build #91036 has finished for PR 21266 at commit
|
retest this please |
Test build #91049 has finished for PR 21266 at commit
|
LGTM Thanks! Merged to master. |
What changes were proposed in this pull request?
This pr added benchmark code
DataSourceReadBenchmark
fororc
,paruqet
,csv
, andjson
based on the existingParquetReadBenchmark
andOrcReadBenchmark
.How was this patch tested?
N/A