[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS #2931

tdas · 2014-10-24T18:58:38Z

As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.

Most of this code has been written by @harishreedharan

tdas · 2014-10-24T18:59:12Z

@JoshRosen Can you take a look?

SparkQA · 2014-10-24T19:04:57Z

Test build #22152 has started for PR 2931 at commit eadde56.

This patch merges cleanly.

SparkQA · 2014-10-24T21:04:57Z

Test build #22152 timed out for PR 2931 at commit eadde56 after a configured wait of 120m.

AmplabJenkins · 2014-10-24T21:05:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22152/
Test FAILed.

tdas · 2014-10-24T21:31:05Z

Jenkins, test this.

SparkQA · 2014-10-24T21:52:29Z

Test build #420 has started for PR 2931 at commit eadde56.

This patch merges cleanly.

SparkQA · 2014-10-24T22:42:50Z

Test build #420 has finished for PR 2931 at commit eadde56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

harishreedharan · 2014-10-24T22:44:43Z

The HdfsBackedRDDSuite is passing - not sure why there are some other failures. Maybe we are missing some cleanup?

tdas · 2014-10-25T01:00:37Z

streaming/src/main/scala/org/apache/spark/streaming/rdd/HDFSBackedBlockRDD.scala

+    val partition = split.asInstanceOf[HDFSBackedBlockRDDPartition]
+    val locations = getBlockIdLocations()
+    locations.getOrElse(partition.blockId,
+      HdfsUtils.getBlockLocations(partition.segment.path, hadoopConfiguration)


Can you explain how this code gets the block locations of the segment of the file that the partition needs? The offsets dont seem to be passed on to the HDFSUtils.getBlockLocations

Fixed this one in the PR sent to your repo.

… on HDFS

JoshRosen · 2014-10-25T01:23:38Z

streaming/src/main/scala/org/apache/spark/streaming/rdd/HDFSBackedBlockRDD.scala

+  }
+
+  // Hadoop Configuration is not serializable, so broadcast it as a serializable.
+  val broadcastedHadoopConf = sc.broadcast(new SerializableWritable(hadoopConfiguration))


Over in #2935, @davies is planning to add some code to SerializableWritable to address the Hadoop Configuration constructor thread-safety issue, so you shouldn't have to do it here once we've merged that patch.

Does it make sense to take the SerializableWritable as the argument in the constructor (as being done in #2935) or should we just take the hadoopConf and wrap it in the SerializableWritable once that is merged? We don't want to change the interface later.

For now I am leaving this as is. Lets revisit this later if needed.

JoshRosen · 2014-10-25T03:05:43Z

I left a pass of fairly shallow style comments; I'll loop back later to offer more substantive feedback and to actually check that I understand this logic.

Make sure getBlockLocations uses offset and length to find the blocks on...

SparkQA · 2014-10-25T03:29:52Z

Test build #22189 has started for PR 2931 at commit c709f2f.

This patch merges cleanly.

SparkQA · 2014-10-25T04:27:42Z

Test build #22189 has finished for PR 2931 at commit c709f2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HDFSBackedBlockRDDPartition(
- class HDFSBackedBlockRDD[T: ClassTag](

AmplabJenkins · 2014-10-25T04:27:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22189/
Test FAILed.

Shutdown spark context after tests. Formatting/minor fixes

SparkQA · 2014-10-27T17:49:45Z

Test build #22300 has started for PR 2931 at commit 9c86a61.

This patch merges cleanly.

SparkQA · 2014-10-27T18:58:59Z

Test build #22300 has finished for PR 2931 at commit 9c86a61.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HDFSBackedBlockRDDPartition(
- class HDFSBackedBlockRDD[T: ClassTag](

JoshRosen · 2014-10-29T18:54:54Z

This looks good to me.

rxin · 2014-10-29T21:22:35Z

streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala

+    @transient override val blockIds: Array[BlockId],
+    @transient val segments: Array[WriteAheadLogFileSegment],
+    val storeInBlockManager: Boolean,
+    val storageLevel: StorageLevel


nitpick: the common style in spark is

val storageLevel: StorageLevel) extends BlockRDD[T](sc, blockIds) {

rxin · 2014-10-29T21:38:01Z

@harishreedharan / @tdas I made a few more comments. Most are just nits that I've left earlier.

harishreedharan · 2014-10-29T21:52:10Z

Thanks @rxin. Updates coming soon.

tdas · 2014-10-30T04:33:38Z

@rxin I updated. Only part i am not in agreement is the preferred location logic.

SparkQA · 2014-10-30T04:37:27Z

Test build #22521 has started for PR 2931 at commit ed5fbf0.

This patch merges cleanly.

harishreedharan · 2014-10-30T04:37:37Z

Apart from the readability, does one have a performance benefit over the other?

SparkQA · 2014-10-30T05:47:41Z

Test build #22521 has finished for PR 2931 at commit ed5fbf0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriteAheadLogBackedBlockRDDPartition(
- class WriteAheadLogBackedBlockRDD[T: ClassTag](

AmplabJenkins · 2014-10-30T05:47:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22521/
Test PASSed.

tdas · 2014-10-30T08:04:36Z

@harishreedharan I dont think so. The block location is called only once in both, and the hdfs location is called only once and only if required. I dont think there is any issue in performance between these two possible different implementations.

rxin · 2014-10-30T08:07:25Z

streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala

+    def blockLocations = getBlockIdLocations().get(partition.blockId)
+    def segmentLocations = HdfsUtils.getFileSegmentLocations(
+      partition.segment.path, partition.segment.offset, partition.segment.length, hadoopConfig)
+    blockLocations.orElse(segmentLocations).getOrElse(Seq.empty)


It's not walking over my dead body type of thing, but I think declaring two inline functions for this, coupled with orElse / getOrElse is less intuitive to most people.

Maybe others can chime in here. @shivaram @pwendell

Yeah its not very ideal as I think the most easy to understand is something like

if ( ) { blockLocations } else if ( ) { segmentLocations } else { Seq.empty }

but this isnt too bad if the above isn't possible

actually this discussion is moot because we should just let getFileSegmentLocations return Seq[String] rather than Option[Seq[String]], and then this should only consist of two branches, accomplishable with a single getOrElse.

This is the final version that I am doing then.

val blockLocations = getBlockIdLocations().get(partition.blockId) def segmentLocations = HdfsUtils.getFileSegmentLocations(...) blockLocations.getOrElse(segmentLocations)

Correct. Once we make that change, I think both the getOrElse and the if..else solutions are equivalent - one is a scala way of doing things, and the other is the "traditional" way. The ones using def/lazy val is really a more scala way of doing it.

I have no preference for any one method, but would generally consider the overhead and performance incurred by each and I am not that much of an expert in scala to know.

rxin · 2014-10-30T08:08:36Z

@tdas you also missed one other comments ...

tdas · 2014-10-30T10:32:01Z

@rxin, crap, i missed that. Personally I find the parenthesis ending in the next line more logical as braces always end in next line. But will do it in the interest of consistency.

SparkQA · 2014-10-30T10:49:52Z

Test build #22537 has started for PR 2931 at commit 4a5866f.

This patch merges cleanly.

SparkQA · 2014-10-30T11:59:39Z

Test build #22537 has finished for PR 2931 at commit 4a5866f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriteAheadLogBackedBlockRDDPartition(
- class WriteAheadLogBackedBlockRDD[T: ClassTag](

AmplabJenkins · 2014-10-30T11:59:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22537/
Test PASSed.

SparkQA · 2014-10-30T20:50:11Z

Test build #22562 has started for PR 2931 at commit 209e49c.

This patch merges cleanly.

tdas · 2014-10-30T20:52:44Z

Alright! I think we have converged to the best solution here. I am going to wait for the tests to pass and then converge. Thanks @rxin and @JoshRosen for all the feedback!

SparkQA · 2014-10-30T22:06:15Z

Test build #22562 has finished for PR 2931 at commit 209e49c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriteAheadLogBackedBlockRDDPartition(
- class WriteAheadLogBackedBlockRDD[T: ClassTag](

AmplabJenkins · 2014-10-30T22:06:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22562/
Test PASSed.

Transferred HDFSBackedBlockRDD for the driver-ha-working branch

eadde56

tdas reviewed Oct 25, 2014
View reviewed changes

Make sure getBlockLocations uses offset and length to find the blocks…

5cce16f

… on HDFS

JoshRosen reviewed Oct 25, 2014
View reviewed changes

Merge pull request #21 from harishreedharan/driver-ha-rdd

c709f2f

Make sure getBlockLocations uses offset and length to find the blocks on...

harishreedharan and others added 2 commits October 24, 2014 23:43

Shutdown spark context after tests. Formatting/minor fixes

2878c38

Merge pull request #22 from harishreedharan/driver-ha-rdd

9c86a61

Shutdown spark context after tests. Formatting/minor fixes

rxin reviewed Oct 29, 2014
View reviewed changes

Minor updates.

ed5fbf0

rxin reviewed Oct 30, 2014
View reviewed changes

Addressed one more comment.

4a5866f

Better fix to style issue.

209e49c

tdas changed the title ~~[SPARK-4027][Streaming] HDFSBasedBlockRDD to read received either from BlockManager or WAL in HDFS~~ [SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS Oct 30, 2014

asfgit closed this in fb1fbca Oct 30, 2014

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS #2931

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS #2931

Conversation

tdas commented Oct 24, 2014

tdas commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

tdas commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

harishreedharan commented Oct 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

JoshRosen commented Oct 29, 2014

Choose a reason for hiding this comment

rxin commented Oct 29, 2014

harishreedharan commented Oct 29, 2014

tdas commented Oct 30, 2014

SparkQA commented Oct 30, 2014

harishreedharan commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

tdas commented Oct 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Oct 30, 2014

tdas commented Oct 30, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

SparkQA commented Oct 30, 2014

tdas commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014