-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method #22501
Conversation
Test build #96369 has finished for PR 22501 at commit
|
* 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar> | ||
* 2. build/sbt "sql/test:runMain <this class>" | ||
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" | ||
* Results will be written to "benchmarks/WideSchemaBenchmark-results.txt". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix doc generation failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun. Actually I'm waiting for #22484. I want to move withTempDir()
to RunBenchmarkWithCodegen.scala
.
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideSchemaBenchmark.scala
Test build #97056 has finished for PR 22501 at commit
|
@@ -48,15 +48,11 @@ abstract class BenchmarkBase { | |||
if (!file.exists()) { | |||
file.createNewFile() | |||
} | |||
output = Some(new FileOutputStream(file)) | |||
output = Option(new FileOutputStream(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like irrelevant pig-back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here because: #22443 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, @HyukjinKwon meant when you need to touch this file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that I will forget it after a long time, so I am changing this time. I should revert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you replace Some
to Option
? Are you worrying new FileOutputStream(file)
becomes null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that there's no point of checking null
below from my cursory look. If there's no chance that it becomes null
, we can leave it Some
and remove null
check below.
Test build #97534 has finished for PR 22501 at commit
|
1 cols x 100000 rows (read in-mem) 22 / 25 4.6 219.4 1.0X | ||
1 cols x 100000 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X | ||
1 cols x 100000 rows (read parquet) 45 / 49 2.2 449.6 0.5X | ||
1 cols x 100000 rows (write parquet) 204 / 223 0.5 2044.4 0.1X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the difference on ratio, this might be a little regression on Parquet writer from Spark 2.1.0 (SPARK-17335).
cc @cloud-fan and @gatorsmile , @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how this happens. Can you create a JIRA ticket to investigate this regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a parquet issue. I found that the binary write performance is a little worse after upgrading to parquet 1.10.0: apache/parquet-java#505. I will verify it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following EC2 result shows the consistent ratio like Spark 2.1.0. The result on Mac seemed to be unstable for some unknown reason like #22501 (comment).
1 cols x 100000 rows (read parquet) 61 / 70 1.6 610.2 0.6X
1 cols x 100000 rows (write parquet) 209 / 233 0.5 2086.1 0.2X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, so you are saying that it doesn't appear that there is a performance regression, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2500 cols x 40 rows (read in-mem) 261 / 434 0.4 2607.3 0.1X | ||
2500 cols x 40 rows (exec in-mem) 624 / 701 0.2 6240.5 0.0X | ||
2500 cols x 40 rows (read parquet) 196 / 301 0.5 1963.4 0.1X | ||
2500 cols x 40 rows (write parquet) 687 / 1049 0.1 6870.6 0.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gap between best
and average
is too high in line 32 and line 33.
I'll try to run this on EC2, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this large gap was removed at EC2 result.
1 deep x 100000 rows (write parquet) 195 / 219 0.5 1945.1 0.1X | ||
100 deep x 1000 rows (read in-mem) 39 / 57 2.5 393.1 0.5X | ||
100 deep x 1000 rows (exec in-mem) 480 / 556 0.2 4795.7 0.0X | ||
100 deep x 1000 rows (read parquet) 7943 / 7950 0.0 79427.5 0.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @wangyum . I ran the test on EC2 |
thank you guys for refreshing the benchmarks and results! It's very helpful. If possible, can we post the perf regressions we found in the umbrella JIRA? Then people can see if the perf regression is reasonable(if we have addressed it) or investigate how the regression was introduced. Thanks! |
Test build #97627 has finished for PR 22501 at commit
|
seems jenkins is broken, cc @shaneknapp
|
@cloud-fan After updating on EC2, almost ratio and values looks more stable and reasonable for now. The following two are noticeable changes, but it looks like Parquet Writer improvement (instead of regression). 1. Read/Write ratio is reverted ( - 128 x 8 deep x 1000 rows (read parquet) 69 / 74 1.4 693.9 0.2X
- 128 x 8 deep x 1000 rows (write parquet) 78 / 83 1.3 777.7 0.2X
+ 128 x 8 deep x 1000 rows (read parquet) 351 / 379 0.3 3510.3 0.1X
+ 128 x 8 deep x 1000 rows (write parquet) 199 / 203 0.5 1988.3 0.2X 2. Read/Write ratio is changed noticeably ( - 1024 x 11 deep x 100 rows (read parquet) 426 / 433 0.2 4263.7 0.0X
- 1024 x 11 deep x 100 rows (write parquet) 91 / 98 1.1 913.5 0.1X
+ 1024 x 11 deep x 100 rows (read parquet) 2063 / 2078 0.0 20629.2 0.0X
+ 1024 x 11 deep x 100 rows (write parquet) 248 / 266 0.4 2475.1 0.1X Since this is the first attempt to track this and the previous result is too old, there exists some obvious limitation during comparison. From Spark 2.4.0, we can get a consistent compasison instead of |
Retest this please. |
Test build #97642 has finished for PR 22501 at commit
|
retest this please |
Test build #97644 has finished for PR 22501 at commit
|
@cloud-fan -- pip isn't broken... the actual error is found right above what you cut and pasted:
i won't be able to look any deeper in to this until at least tomorrow at the earliest. |
I guess it's related with pip packaging tho.
It's from setup.py |
I am looking at each commit from the latest to old at https://github.com/apache/spark/commits/master |
Thanks. It might rather more be related to external factors. |
Thanks, when it was successful, this is a part of log from this
Now, we are seeing the following. Is this a problem related to py4j or pyspark on external sites?
|
Is this the oldest test failure related to this type of failure? |
Yup, I made a fix #22782 |
Thanks, I found |
Retest this please. |
Test build #97665 has finished for PR 22501 at commit
|
Merged to master. |
Thank you, @wangyum and all! |
## What changes were proposed in this pull request? Refactor `WideSchemaBenchmark` to use main method. 1. use `spark-submit`: ```console bin/spark-submit --class org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` 2. Generate benchmark result: ```console SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark" ``` ## How was this patch tested? manual tests Closes apache#22501 from wangyum/SPARK-25492. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Any ideas how much memory does this benchmark require. I set 32Gb but it crashes with OOM on the master:
|
It sounds like a regression, @MaxGekk . This test suite was executed with the example in the code without any additional information like |
FYI, this was tested in this PR, |
I have rechecked the benchmark on the recent master - it is still failing with OOM. I opened the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-30429 |
What changes were proposed in this pull request?
Refactor
WideSchemaBenchmark
to use main method.spark-submit
:bin/spark-submit --class org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
How was this patch tested?
manual tests