-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-25299: Add rest of shuffle writer benchmarks #507
SPARK-25299: Add rest of shuffle writer benchmarks #507
Conversation
@yifeih can you target this to spark-25299? The downstream PR merged. |
output = output) | ||
|
||
addBenchmarkCase(benchmark, "without transferTo") { timer => | ||
val shuffleWriter = getWriter(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not closing the writer in this case.
I'm taking a closer look at this and wonder if we can model this similar to how JUnit models Parameterized Tests. Parameterized tests are useful when you have a configuration matrix that you want to try. In this case, the configuration matrix might be:
- Dataset size: small vs. large
- TransferTo: Enabled vs. disabled
- writer type: unsafe, sort, bypass-merge-sort
Can we perhaps encode this explicitly? I think something like this, though this is by no means the most elegant possible - there might be something better here:
sealed trait WriterType
case object UnsafeWriterType extends WriterType
case object SortWriterType extends WriterType
case object BypassMergeSortWriterType extends WriterType
def doBenchmark(writerType: WriterType, transferTo: Boolean, useLargeDataset: Boolean): Unit = {
val benchmarkName = s"shuffle writer, type $writerType, transferTo: $transferTo, datasetSize: ${if useLargeDataset "large" else "small"}"
val writer = writerType match {
case UnsafeWriterType => new UnsafeShuffleWriter(... transferTo)
case SortWriterType => new SortShuffleWriter(... transferTo)
case BypassMergeSortWriterType => new BypassMerge... (... transferTo)
default => throw...
}
val datasetSize = if (useLargeDataset )... else ...
// Do the benchmark stuff given your objects
}
// I think there's a much more Scala fluent esque way to do this by the way - look into product APIs?
Seq(UnsafeWriterType, SortWriterType, BypassMergeSSortWriterType).forEach { writerType =>
Seq(true, false).forEach { transferTo =>
Seq(true, false).forEach { useLargeDataset =>
doBenchmark(writerType, transferTo, useLargeDataset)
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I think the problem comes when the setup is slightly different. For example, the different writers need different mocks and objects setup because they're different in nature. But i think we can make this a little more parameterized, like passing in the writer, size, and some spill file assert number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tricky balance because we don't want to make this a giant switch-case statement either. If the writer bootstrap is highly dependent on the type of writer it is, I would say the parameters can be just the size and enabling transferTo vs. not, and then we have separate methods each for bootstrapping their own writer against those parameters.
I don't have a great sense of what's best here; we're not using something like JUnit which has a canonical framework for this, so whatever we build has to come from first principles. I'm open to ideas, so feel free to propose something and we can evaluate it together.
private val DATA_SIZE_LARGE = | ||
PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES/2/DEFAULT_DATA_STRING_SIZE | ||
|
||
def getWriter(transferTo: Boolean): UnsafeShuffleWriter[String, String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea I just thought of - why not make newWriter(transferTo: Boolean)
an abstract
method in ShuffleWRiterBenchmarkBase
, then pass transferTo
to addBenchmarkCase
but removes the need to pass the writer supplier? The superclass can just call such an abstract method in addBenchmarkCase
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transferTo
flag only applies to BypassMergeSortShuffleWriter
and UnsafeShuffleWriter
, but the SortShuffleWriter
tests have other parameters, none of which are transferTo
@@ -50,12 +50,16 @@ done | |||
|
|||
echo "Running SPARK-25299 benchmarks" | |||
|
|||
SPARK_GENERATE_BENCHMARK_FILES=1 ./build/sbt "sql/test:runMain org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriterBenchmark" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for right now, but if this list grows beyond size 5, let's make a text file with all the classes that we need to benchmark, and then for-loop over it.
@@ -121,10 +121,23 @@ abstract class ShuffleWriterBenchmarkBase extends BenchmarkBase { | |||
blockManager) | |||
} | |||
|
|||
def addBenchmarkCase(benchmark: Benchmark, name: String)(func: Benchmark.Timer => Unit): Unit = { | |||
def addBenchmarkCase(benchmark: Benchmark, name: String, size: Int, | |||
writerSupplier: () => ShuffleWriter[String, String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Start args on this line, then 1 arg per line, with 4-space indentation from def
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a style change and then we're good to merge here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
================================================================================================
BypassMergeSortShuffleWriter write
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite without spill: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without disk spill 2 3 2 0.5 2048.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite with spill: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo 7369 7565 121 0.9 1098.1 1.0X
with transferTo 7515 7568 33 0.9 1119.8 1.0X
================================================================================================
SortShuffleWriter writer
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter without spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills 10 16 5 0.1 10192.2 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter with spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
no map side combine 14008 14157 133 0.5 2087.3 1.0X
with map side aggregation 13852 13946 75 0.5 2064.1 1.0X
with map side sort 13797 13984 142 0.5 2055.9 1.0X
================================================================================================
UnsafeShuffleWriter write
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter without spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills 20 23 3 0.1 19926.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter with spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo 15742 15791 40 0.9 1172.8 1.0X
with transferTo 15888 15994 67 0.8 1183.7 1.0X
No description provided.