[SPARK-7884] Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader #6423

massie · 2015-05-27T00:21:22Z

This commit updates the shuffle read path to enable ShuffleReader implementations more control over the deserialization process.

The BlockStoreShuffleFetcher.fetch() method has been renamed to BlockStoreShuffleFetcher.fetchBlockStreams(). Previously, this method returned a record iterator; now, it returns an iterator of (BlockId, InputStream). Deserialization of records is now handled in the ShuffleReader.read() method.

This change creates a cleaner separation of concerns and allows implementations of ShuffleReader more flexibility in how records are retrieved.

sryza · 2015-05-27T00:24:17Z

@massie mind associating this with a JIRA? It's not a huge code change, but I wouldn't really classify it as trivial.

JoshRosen · 2015-05-27T00:25:22Z

core/src/main/scala/org/apache/spark/shuffle/FileShuffleBlockResolver.scala

@@ -105,7 +105,8 @@ private[spark] class FileShuffleBlockResolver(conf: SparkConf)
   * when the writers are closed successfully
   */
  def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
-      writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = {
+      writeMetrics: ShuffleWriteMetrics,
+      getDiskWriter: (BlockId, File, SerializerInstance, Int, ShuffleWriteMetrics) => BlockObjectWriter = blockManager.getDiskWriter): ShuffleWriterGroup = {


This isn't a Java-friendly interface, which is going to be a problem for new shuffle code that I'm working on.

Thanks for pointing that out, @JoshRosen. I'll update this to be more Java-friendly tomorrow. In the meantime, feel free to make suggestions on how to customize the creation of a BlockObjectWriter when calling forMapTask().

SparkQA · 2015-05-27T00:26:02Z

Test build #33549 has finished for PR 6423 at commit becdc81.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait BlockRecordIteratorFactory

massie · 2015-05-27T00:26:46Z

This is the minimum amount of changes to the internal APIs I needed to make in order to get completely abstract the Parquet Shuffle Manager.

See https://issues.apache.org/jira/browse/SPARK-7263.

JoshRosen · 2015-05-27T00:29:33Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -29,6 +30,16 @@ import org.apache.spark.serializer.{SerializerInstance, Serializer}
 import org.apache.spark.util.{CompletionIterator, Utils}

 /**
+ * Factory class that creates iterators to read records from fetched blocks
+ */
+trait BlockRecordIteratorFactory {


Is this class intended to be public? It might be okay as a developer API but I don't think that we should commit to making any of the shuffle internals as stable public interfaces.

Also, I think that any new shuffle interfaces should be implemented as Java interfaces, not Scala traits.

This was meant to be a developer API . I can update this to be a Java interface.

JoshRosen · 2015-05-27T00:41:35Z

Our shuffle code is extremely hard to understand in it's current form and I'm hesitant to introduce new interfaces / extension points until we clean up the and document the existing code, so I'd like to see a better standalone description of the changes in this patch before I review it.

massie · 2015-05-27T01:14:19Z

I agree that the shuffle code is extremely hard to understand. I certainly don't want to make matter worse which is why I tried to make the minimal changes necessary.

I'm not exactly clear what you mean when you say that we need to cleanup and document the existing shuffle code before this can be reviewed. Can you point me to the PR or branch where that work is being done?

kayousterhout · 2015-05-27T02:11:50Z

core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

@@ -17,22 +17,21 @@

 package org.apache.spark.shuffle.hash

+import org.apache.spark.storage._


style nit: import ordering (all of the spark imports should be grouped and alphabetized)

massie · 2015-05-28T17:32:41Z

I updated this pull request to only address the read path in order to make it easier to review.

There are no new interfaces added and I believe this approach will be cleaner and easier to reason about for others working on the shuffle code.

SparkQA · 2015-05-28T17:35:24Z

Test build #33665 has finished for PR 6423 at commit 3b32099.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

massie · 2015-05-28T17:37:12Z

I'm fixing the Scala style checks now.

massie · 2015-05-28T17:58:28Z

core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleReader.scala

+       readMetrics.incRecordsRead(1)
+       delegate.next()
+     }
+    }.asInstanceOf[Iterator[Nothing]]


This asInstanceOf is necessary but ugly. The alternative, I believe, would be to move this into a method with a generic type, e.g...

def newInterruptibleIterator[T](context: ..., completionIter: ...) = { new InterruptibleIterator[T](context, completionIter) { val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency() override def next(): T = { readMetrics.incRecordsRead(1) delegate.next() } }

Let me know which approach you prefer.

This version is nice and short, but it does make it a bit hard to follow the types. What do you think of more explicit casting in each branch, to make it more clear what is going on? eg.:

// Update read metrics for each record materialized val iter = new InterruptibleIterator[(Any, Any)](context, recordIterator) { val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency() override def next(): (Any, Any) = { readMetrics.incRecordsRead(1) delegate.next() } } val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) { if (dep.mapSideCombine) { // we are reading values that are already combined val combinedKeyValuesIterator = iter.asInstanceOf[Iterator[(K,C)]] new InterruptibleIterator(context, dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)) } else { // we don't know the value type, but also don't care -- the dependency *should* // have made sure its compatible w/ this aggregator, which will convert the value // type to the combined type C val keyValuesIterator = iter.asInstanceOf[Iterator[(K,Nothing)]] new InterruptibleIterator(context, dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)) } ...

this is just an idea ... I'm not entirely convinced myself.

The old BlockStoreShuffleFetcher.fetch(...) returned Iterator[Nothing] which is why that current casting is necessary -- but it's ugly.

I like your idea of casting in the branches to help make the types more explicit. I'll make the change now unless you'd like more time to think about it.

yes, I completely realize that this is not a new problem you are introducing. I think I've been confused by the types in the old code a couple of times in the past. I was just thinking, as long as you are touching this, rather than making the bad code slightly worse, maybe we can make it slightly better :P

@squito I like the cut of your jib. Pushing the update now.

SparkQA · 2015-05-28T19:06:25Z

Test build #33666 has finished for PR 6423 at commit 87323dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

massie · 2015-05-28T20:12:55Z

The tail of the Jenkins console was...

[info] Run completed in 59 minutes, 13 seconds.
[info] Total number of tests run: 490
[info] Suites: completed 109, aborted 0
[info] Tests: succeeded 490, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 576, Failed 0, Errors 0, Passed 576
[error] (yarn/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 3676 s, completed May 28, 2015 12:06:21 PM
[error] Got a return code of 1 on line 211 of the run-tests script.
Archiving unit tests logs...
> Send successful.
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE

It looks like all the *ShuffleSuite tests passed but there was an unrelated YARN error?

massie · 2015-05-28T20:15:26Z

Jenkins, test this please.

SparkQA · 2015-05-28T22:11:50Z

Test build #33676 has finished for PR 6423 at commit 87323dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

massie · 2015-05-29T15:47:54Z

@JoshRosen I hope that my changes address all your concerns. There are no new interfaces or extension points so this change shouldn't complicate the shuffle code you're working on. The only change here is to move serialization into the ShuffleReader instead of having it in the core.

JoshRosen · 2015-05-29T20:30:18Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

-            // so we don't release it again in cleanup.
+          // Once the single-element (is0) iterator is exhausted, release the buffer so that we
+          // don't release it again in cleanup.
+          CompletionIterator[InputStream, Iterator[InputStream]](Iterator(is0), {


I'm actually a bit confused about the Try[Iterator[InputStream]] here. In the old code, we had next() return a (BlockId, Try[Iterator[Any]]) which, if the fetch successful, would contain an iterator of the elements in that individual block. Here, it looks like we're now returning a single-element iterator that contains an InputStream. I think that this is confusing for consumers of this class since the public Try[Iterator[InputStream]] signature might lead them to believe that they have to handle the possibility of multiple input streams being returned. In fact, this is inconsistent with the class-level Scaladoc above, which says that this returns an iterator of "(BlockID, InputStream)".

It sounds like the motivation for returning an iterator here was to try to ensure proper release of the buffer. I'd like to understand if there's a cleaner way to do this, though.

Just to explore options, what if we returned buf (which is a ManagedBuffer) instead of returning an iterator from it? This would push the cleanup obligations to the caller, who might be in a better position to handle them.

On the other hand, I guess there aren't really any useful methods to call on ManagedBuffer besides createInputStream() and release(). If we have a method that returns ManagedBuffer, then that sort of implicitly takes the ManagedBuffer interface and makes it subject to this method's API stability guarantees (which there aren't any yet because this is private[spark]...).

What if we returned an InputStream that was wrapped such that calling close() on it would release the underlying buffer? This avoids exposing ManagedBuffer to the higher-level code and makes cleanup easier to reason about. If we take this approach, it might be good to add a comment somewhere to mention that the caller should ensure that the input stream is eventually closed.

I like the idea of returning an InputStream that requires callers to call close() and drop the use of the single-element CompletionIterator. I'll take make that change now.

sorry jumping in a little late here -- but I think the caller doesn't have to call close(), because its already getting called in the a task completion listener here, which is more robust. It doesn't hurt for the user to call it anyway (another reason why close needs to be idempotent), but otherwise i think its really important that this close gets called by the framework, or else its really easy to have a leak.

(all that said, it looks like its still fine after the patch in its current form.)

JoshRosen · 2015-05-29T22:28:54Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -313,6 +303,33 @@ final class ShuffleBlockFetcherIterator(
  }
 }

+// Helper class that ensures a ManagedBuffer is released on InputStream.close()
+private class WrappedInputStream(
+              delegate: InputStream, buf: ManagedBuffer, var currentResult: FetchResult)


Style nit: the parameters here need to be wrapped differently; see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

Why do we need to pass currentResult here? It's not obvious on first read and I'm worried that someone might remove it by accident unless we add a comment explaining why we need to keep it.

Also, note that the original code would end up mutating the currentResult field in ShuffleBlockFetcherIterator, whereas the currentResult = null here only affects the field in this WrappedInputStream class and leaves the ShuffleBlockFetcherIterator's field untouched.

I looked at the style guide and used the recommendation for def parameters (which are aligned with the definition). I didn't see that I should use the Scala Style Guide; I'll fix this. Thanks for point it out.

SparkQA · 2015-05-29T22:30:17Z

Test build #33777 has finished for PR 6423 at commit b12f912.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-05-29T22:35:47Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

-            currentResult = null
-            buf.release()
-          })
+        Try(buf.createInputStream()).map { inputStream =>


We might also treat this patch as an opportunity to revisit why we're using Try here. It might be fine to keep Try as the return type but I'm not necessarily convinced that we should be calling Try.apply() here since I think it obscures whether we'll need to perform any cleanup after errors (for instance, do we need to free buf? Is buf guaranteed to be non-null or could this fail with an NPE on the buf.createInputStream() call? I feel that the Try.apply() makes it easy to overlook these concerns.

The ShuffleBlockFetcherIterator has no information about server statuses from the map output tracker, shuffle IDs, etc. Using Try allows the BlockStoreShuffleFetcher to reformat exceptions as a FetchFailedException which is the right exception to return to the scheduler.

this is technically not your change -- but do you know what happens if the stream is not consumed in full in a task? Does that lead to memory leaks because close on the stream is never called?

We don't need to worry about a memory leak when the task exits with success or failure since there is a cleanup method registered with the task context, e.g.

// Add a task completion callback (called in both success case and failure case) to cleanup. context.addTaskCompletionListener(_ => cleanup())

However, you're correct that there would be a memory (and file handle) leak, if the InputStream isn't closed in the ShuffleReader. This PR prevents that since serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator returns a NextIterator which closes the stream when the last record is read.

To be more defensive and potentially simplify the code, it might make sense to have a call to ShuffleBlockFetcherIterator.next() to not just return the next InputStream but also close() the last one. This would prevent callers from having more than one InputStream open at a time but I don't think we want that anyway?

@rxin Let me know if you'd like this change to be made to ShuffleBlockFetcherIterator.

We might consider deferring this change to a followup PR; want to file a JIRA issue so that we don't forget to eventually follow up?

Done. https://issues.apache.org/jira/browse/SPARK-8454

SparkQA · 2015-06-18T08:09:41Z

Test build #35095 has finished for PR 6423 at commit 4ea1712.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2015-06-18T18:15:22Z

@massie thanks for updating the tests. It's still a little concerning to me that we don't explicitly check that the iterator returned from HashShuffleReader releases all memory. In theory, someone could change HashShuffleReader to not use asKeyValueIterator (and not use a NextIterator) such the the stream didn't get closed, and we wouldn't have any unit tests for that (so could end up with a memory leak). I realize it's annoying to add a test for that though, and it sounds like @rxin and @JoshRosen are fine with this as-is; given that I'm fine to sign off as well.

massie · 2015-06-18T21:59:58Z

@kayousterhout, Thanks for reviewing this PR. I agree that we should be more defensive moving forward. I think the right next step (no pun intended) is to Update ShuffleBlockFetcherIterator.next to close previous stream when returning a new one. That would prevent any ShuffleReader from leaking memory or file handles.

@rxin I just pushed an update that I believe addresses all your comments. Let me know if I missed anything.

SparkQA · 2015-06-18T22:59:29Z

Test build #35176 has finished for PR 6423 at commit a011bfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-18T23:00:21Z

Jenkins, retest this please.

SparkQA · 2015-06-19T00:59:52Z

Test build #35185 has finished for PR 6423 at commit a011bfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-06-19T05:24:26Z

I talked to @kayousterhout a bit more offline. I think it is actually pretty important to have the same level of test coverage, since this code is very important & tricky to introduce bugs in the future. I don't think it should be super hard -- just need some mocking.

massie · 2015-06-19T15:44:30Z

I make the updates first thing next week.

massie · 2015-06-22T23:02:01Z

@kayousterhout @rxin I just pushed an update which adds a test specifically to validate that the HashShuffleReader.read() method is properly reading values, saving metrics and releasing resources.

SparkQA · 2015-06-23T01:29:45Z

Test build #35492 has finished for PR 6423 at commit f98a1b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

massie · 2015-06-23T16:28:32Z

@kayousterhout @rxin All the tests passed. Let me know if you'd like any more changes made and I'll get to them immediately.

This reverts commit f98a1b9.

rxin · 2015-06-23T20:49:39Z

Thanks. @kayousterhout is looking at this and will merge it.

Proposal for different unit test

SparkQA · 2015-06-23T23:45:04Z

Test build #35608 has finished for PR 6423 at commit d0a1b39.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2015-06-23T23:47:21Z

derp this is my fault...just forgot to mark RecordingManagedBuffer as private[spark]!

massie · 2015-06-23T23:52:24Z

No worries. I'll fix it.

SparkQA · 2015-06-24T02:37:18Z

Test build #35609 has finished for PR 6423 at commit 8b0632c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2015-06-24T06:32:18Z

I will do a final pass on this and then merge tomorrow!

massie · 2015-06-24T06:37:51Z

Thanks, @kayousterhout.

massie · 2015-06-24T22:30:46Z

@kayousterhout Is this still on track for merging today? Let me know if you see anything else that needs to be done.

massie · 2015-06-25T02:13:54Z

@kayousterhout I'm happy to squash the commits and rebase them on master, if that helps.

JoshRosen · 2015-06-25T02:46:43Z

Rebasing or squashing shouldn't be necessary

Sent from my phone

On Jun 24, 2015, at 7:13 PM, Matt Massie [email protected] wrote:

@kayousterhout I'm happy to squash the commits and rebase them on master, if that helps.

—
Reply to this email directly or view it on GitHub.

kayousterhout · 2015-06-25T05:12:08Z

Thanks for all of your work on this @massie! This is now merged into master.

JoshRosen reviewed May 27, 2015
View reviewed changes

massie changed the title ~~Make internal Spark shuffle more customizable~~ [SPARK-7884] Allow Spark shuffle APIs to be more customizable May 27, 2015

kayousterhout reviewed May 27, 2015
View reviewed changes

massie force-pushed the shuffle-api-cleanup branch from becdc81 to 3b32099 Compare May 28, 2015 17:29

massie force-pushed the shuffle-api-cleanup branch from 3b32099 to 87323dc Compare May 28, 2015 17:50

massie reviewed May 28, 2015
View reviewed changes

JoshRosen reviewed May 29, 2015
View reviewed changes

massie force-pushed the shuffle-api-cleanup branch from 9187e17 to b12f912 Compare May 29, 2015 22:27

JoshRosen reviewed May 29, 2015
View reviewed changes

Use PrivateMethodTester on check that delegate stream is closed

a011bfa

Add test to ensure HashShuffleReader is freeing resources

f98a1b9

Revert "Add test to ensure HashShuffleReader is freeing resources"

5186da0

This reverts commit f98a1b9.

kayousterhout and others added 2 commits June 23, 2015 15:47

Added test for HashShuffleReader.read()

290f1eb

Merge pull request #1 from kayousterhout/massie_shuffle-api-cleanup

d0a1b39

Proposal for different unit test

Minor Scala style fixes

8b0632c

asfgit closed this in 7bac2fe Jun 25, 2015

massie deleted the shuffle-api-cleanup branch August 3, 2015 16:54

		@@ -17,22 +17,21 @@

		package org.apache.spark.shuffle.hash

		import org.apache.spark.storage._

[SPARK-7884] Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader #6423

[SPARK-7884] Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader #6423

Conversation

massie commented May 27, 2015

sryza commented May 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 27, 2015

massie commented May 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented May 27, 2015

massie commented May 27, 2015

Choose a reason for hiding this comment

massie commented May 28, 2015

SparkQA commented May 28, 2015

massie commented May 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 28, 2015

massie commented May 28, 2015

massie commented May 28, 2015

SparkQA commented May 28, 2015

massie commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 18, 2015

kayousterhout commented Jun 18, 2015

massie commented Jun 18, 2015

SparkQA commented Jun 18, 2015

JoshRosen commented Jun 18, 2015

SparkQA commented Jun 19, 2015

rxin commented Jun 19, 2015

massie commented Jun 19, 2015

massie commented Jun 22, 2015

SparkQA commented Jun 23, 2015

massie commented Jun 23, 2015

rxin commented Jun 23, 2015

SparkQA commented Jun 23, 2015

kayousterhout commented Jun 23, 2015

massie commented Jun 23, 2015

SparkQA commented Jun 24, 2015

kayousterhout commented Jun 24, 2015

massie commented Jun 24, 2015

massie commented Jun 24, 2015

massie commented Jun 25, 2015

JoshRosen commented Jun 25, 2015

kayousterhout commented Jun 25, 2015