Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-25299: Add rest of shuffle writer benchmarks #507

Merged
merged 80 commits into from
Mar 18, 2019

Conversation

yifeih
Copy link

@yifeih yifeih commented Mar 5, 2019

No description provided.

@mccheah
Copy link

mccheah commented Mar 14, 2019

@yifeih can you target this to spark-25299? The downstream PR merged.

@yifeih yifeih changed the base branch from yh/add-benchmarks-and-ci to spark-25299 March 14, 2019 20:47
output = output)

addBenchmarkCase(benchmark, "without transferTo") { timer =>
val shuffleWriter = getWriter(false)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not closing the writer in this case.

I'm taking a closer look at this and wonder if we can model this similar to how JUnit models Parameterized Tests. Parameterized tests are useful when you have a configuration matrix that you want to try. In this case, the configuration matrix might be:

  • Dataset size: small vs. large
  • TransferTo: Enabled vs. disabled
  • writer type: unsafe, sort, bypass-merge-sort

Can we perhaps encode this explicitly? I think something like this, though this is by no means the most elegant possible - there might be something better here:

sealed trait WriterType
case object UnsafeWriterType extends WriterType
case object SortWriterType extends WriterType
case object BypassMergeSortWriterType extends WriterType

def doBenchmark(writerType: WriterType, transferTo: Boolean, useLargeDataset: Boolean): Unit = {
  val benchmarkName = s"shuffle writer, type $writerType, transferTo: $transferTo, datasetSize: ${if useLargeDataset "large" else "small"}"
  val writer = writerType match {
    case UnsafeWriterType => new UnsafeShuffleWriter(... transferTo)
    case SortWriterType => new SortShuffleWriter(... transferTo)
    case BypassMergeSortWriterType => new BypassMerge... (... transferTo)
    default => throw...
  }
  val datasetSize = if (useLargeDataset )... else ...
  // Do the benchmark stuff given your objects
}

// I think there's a much more Scala fluent esque way to do this by the way - look into product APIs?
Seq(UnsafeWriterType, SortWriterType, BypassMergeSSortWriterType).forEach { writerType =>
  Seq(true, false).forEach { transferTo =>
    Seq(true, false).forEach { useLargeDataset =>
      doBenchmark(writerType, transferTo, useLargeDataset)
    }
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think the problem comes when the setup is slightly different. For example, the different writers need different mocks and objects setup because they're different in nature. But i think we can make this a little more parameterized, like passing in the writer, size, and some spill file assert number

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky balance because we don't want to make this a giant switch-case statement either. If the writer bootstrap is highly dependent on the type of writer it is, I would say the parameters can be just the size and enabling transferTo vs. not, and then we have separate methods each for bootstrapping their own writer against those parameters.

I don't have a great sense of what's best here; we're not using something like JUnit which has a canonical framework for this, so whatever we build has to come from first principles. I'm open to ideas, so feel free to propose something and we can evaluate it together.

private val DATA_SIZE_LARGE =
PackedRecordPointer.MAXIMUM_PAGE_SIZE_BYTES/2/DEFAULT_DATA_STRING_SIZE

def getWriter(transferTo: Boolean): UnsafeShuffleWriter[String, String] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea I just thought of - why not make newWriter(transferTo: Boolean) an abstract method in ShuffleWRiterBenchmarkBase, then pass transferTo to addBenchmarkCase but removes the need to pass the writer supplier? The superclass can just call such an abstract method in addBenchmarkCase. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transferTo flag only applies to BypassMergeSortShuffleWriter and UnsafeShuffleWriter, but the SortShuffleWriter tests have other parameters, none of which are transferTo

@@ -50,12 +50,16 @@ done

echo "Running SPARK-25299 benchmarks"

SPARK_GENERATE_BENCHMARK_FILES=1 ./build/sbt "sql/test:runMain org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriterBenchmark"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for right now, but if this list grows beyond size 5, let's make a text file with all the classes that we need to benchmark, and then for-loop over it.

@@ -121,10 +121,23 @@ abstract class ShuffleWriterBenchmarkBase extends BenchmarkBase {
blockManager)
}

def addBenchmarkCase(benchmark: Benchmark, name: String)(func: Benchmark.Timer => Unit): Unit = {
def addBenchmarkCase(benchmark: Benchmark, name: String, size: Int,
writerSupplier: () => ShuffleWriter[String, String],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Start args on this line, then 1 arg per line, with 4-space indentation from def.

Copy link

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style change and then we're good to merge here.

@bulldozer-bot bulldozer-bot bot merged commit ddc9905 into spark-25299 Mar 18, 2019
@bulldozer-bot bulldozer-bot bot deleted the yh/add-benchmarks-and-ci-two-writers branch March 18, 2019 19:06
Copy link

@svc-spark-25299 svc-spark-25299 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

================================================================================================
BypassMergeSortShuffleWriter write
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite without spill:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without disk spill                      2              3           2          0.5        2048.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite with spill:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo                                 7369           7565         121          0.9        1098.1       1.0X
with transferTo                                    7515           7568          33          0.9        1119.8       1.0X

================================================================================================
SortShuffleWriter writer
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter without spills:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills                         10             16           5          0.1       10192.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter with spills:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
no map side combine                               14008          14157         133          0.5        2087.3       1.0X
with map side aggregation                         13852          13946          75          0.5        2064.1       1.0X
with map side sort                                13797          13984         142          0.5        2055.9       1.0X

================================================================================================
UnsafeShuffleWriter write
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter without spills:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills                         20             23           3          0.1       19926.9       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter with spills:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo                                15742          15791          40          0.9        1172.8       1.0X
with transferTo                                   15888          15994          67          0.8        1183.7       1.0X



Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants