Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the functionality of storage of received data #2940

Closed
wants to merge 7 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Oct 25, 2014

As part of the initiative to prevent data loss on streaming driver failure, this JIRA tracks the subtask of implementing a ReceivedBlockHandler, that abstracts the functionality of storage of received data blocks. The default implementation will maintain the current behavior of storing the data into BlockManager. The optional implementation will store the data to both BlockManager as well as a write ahead log.

@tdas
Copy link
Contributor Author

tdas commented Oct 25, 2014

@JoshRosen Please take a look. This is still not polished, and their might be comments missing, style issues, etc. I am putting this up to get early feedback.

Thing to note is that the RecivedBlockInfo is being moved from streaming.scheduler to streaming.receiver, it is used in developer API, but itself not exposed (should ideally be). Point of discussion.

@tdas
Copy link
Contributor Author

tdas commented Oct 25, 2014

@harishreedharan Please take a look as well.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22191 has started for PR 2940 at commit 95a4987.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22191 has finished for PR 2940 at commit 95a4987.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22191/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22311 has started for PR 2940 at commit 18aec1e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22311 has finished for PR 2940 at commit 18aec1e.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22311/
Test FAILed.

@jerryshao
Copy link
Contributor

Hi TD, are you going to expose some store() API in Receiver which will directly use WriteAheadLogBasedBlockHandler to store block? Seems now these two implementations of ReceivedBlockHandler is enabled through configuration.

@tdas
Copy link
Contributor Author

tdas commented Oct 28, 2014

@jerryshao No, there will be no new API in the Receiver. With the configuration change, the existing store() API will go through the new WriteAheadLogBasedBlockHandler.

if (env.conf.getBoolean("spark.streaming.receiver.writeAheadLog.enable", false)) {
if (checkpointDirOption.isEmpty) {
throw new SparkException(
"Cannot enable receiver write-ahead log without checkpoint directory set. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit off topic (and we can deal with this later) - but should we make the checkpoint directory into a sparkConf setting? That way we could do this type of validation earlier on. Right now unfortunately we can't distinguish here whether the user didn't call checkpoint or whether there was just a bug somewhere in Spark code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is something that requires changes at both spark as well as spark streaming level, and probably further discussed, and hence deferred to the next release deadline.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22397 has started for PR 2940 at commit 2f025b3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22397 has finished for PR 2940 at commit 2f025b3.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22397/
Test FAILed.

@tdas
Copy link
Contributor Author

tdas commented Oct 29, 2014

@JoshRosen @pwendell Ready for another round of reviews


// For processing futures used in parallel block storing into block manager and write ahead log
implicit private val executionContext = ExecutionContext.fromExecutorService(
Utils.newDaemonFixedThreadPool(2, this.getClass.getSimpleName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier, this might actually end up being a bottle neck. Since you could write using multiple threads in the same receiver - we are basically blocking more than one write from happening at any point in time. Since the BlockManager can handle more writes in parallel, we should probably use a much higher value than 2.

That said, the WAL Writer would still be a bottle neck - since the writes to the WAL have to be synchronized. So I am not entirely sure if having more than 2 threads helps a whole lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a good scope of future optimization here. We can always create multiple WALManagers and write in parallel to multiple WALs. That would improve performance depending on where the bottle neck is.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22399 has finished for PR 2940 at commit 33c30c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22399/
Test FAILed.

@tdas
Copy link
Contributor Author

tdas commented Oct 29, 2014

Jenkins, test this.

@pwendell
Copy link
Contributor

I took another pass - the main thing blocking this for me is cleaning up the type signature to not have Option[Any]. I made a proposal earlier and I stick by that one. Do you see an issue with it?

@pwendell
Copy link
Contributor

The way this now we have to do runtime type checking in a bunch of places... I think it could be avoided with a fairly simple change.

@tdas
Copy link
Contributor Author

tdas commented Oct 29, 2014

I mentioned this in the earlier in the original thread. This is a tradeoff between generality and type checking. I want the code in ReceiverSupervisorImpl, and ReceivedBlockInfo to be agnostic to the implementation of the ReceivedBlockHandler that is in use. Otherwise no point of making this a pluggable interface, where we could plug in implementation like CassandraBasedBlockHandler in future.

To achieve that these classes have to be agnostic to the exact return type of the ReceivedBlockHandler. In which case there are three options as far as I think.

  1. Refer to the handler as ReceivedBlockHandler[Any]
  2. Refer to the handler as ReceivedBlockHandler[_]
  3. Give the return type a trait name (say ReceivedBlockHandlerStoreResult) and refer to the handler as ReceivedBlockHandler[ReceivedBlockHandlerStoreResult]

I dont see much advantage in 2 over 1. I thought of 3 as an option. But it gets weird for the default BlockManagerBasedBlockHandler, as it always returns None, and so does not really need the type. But 3 will force a dummy type (that extends ReceivedBlockHandlerStoreResult) to be defined.

Regarding runtime type checking, I actually removed it in the last update. There is not runtime type check in this PR, the ReceivedBlockInfo simply stores whatever is returned as a black box (called persistenceInfoOption). Later (in next PR) a WriteAheadLogBackedBlockRDD will be created with this peristenceInfoOption, where it will be casted to Option[WriteAheadLogFileSegment] from Option[Any]. Thats the only place where runtime type checking / casting will be used.

Thoughts?

/** Trait that represents a class that handles the storage of blocks received by receiver */
private[streaming] trait ReceivedBlockHandler {

/** Store a received block with the given block id */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the Option[Any] has provoked so much discussion, maybe we should document the return type in this scaladoc (e.g. say that it's arbitrary metadata or something); currently, it's not clear what's being returned.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22516 has started for PR 2940 at commit df5f320.

  • This patch merges cleanly.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

For reference to others, I spoke @pwendell and @JoshRosen offline and decided that a slightly modified version of suggestion 3 (in my earlier comment) is the best middle ground that addresses all the concerns. What I have done is add a trait ReceivedBlockStoreResult. ReceivedBlockHandler.storeBlock returns a ReceivedBlockStoreResult object, the contents of that object is not of any concern to ReceiverSupervisorImpl and simply passed on. Implementations of ReceivedBlockHandler all return ReceivedBlockStoreResult, so no generic typing. This keeps the complexity low, while keeping ReceiverSupervisorImpl code generic, and addressing Patrick's concern of Option[Any] being non-intuitive.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22520 has started for PR 2940 at commit f192f47.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22516 has finished for PR 2940 at commit df5f320.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22516/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22520 has finished for PR 2940 at commit f192f47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22520/
Test FAILed.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22529 has started for PR 2940 at commit f192f47.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22529 has finished for PR 2940 at commit f192f47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22529/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22558 has started for PR 2940 at commit 78a4aaa.

  • This patch merges cleanly.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

@pwendell Please take a look, hopefully this change addresses your concerns.

@pwendell
Copy link
Contributor

LGTM - the new approach looks good.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

Thanks @pwendell and @JoshRosen for all the feedback. I am merging this.
Err, after the tests pass.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22558 has finished for PR 2940 at commit 78a4aaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22558/
Test PASSed.

@asfgit asfgit closed this in 234de92 Oct 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants