Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem #3785

Closed
wants to merge 11 commits into from

Conversation

YanTangZhai
Copy link
Contributor

Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24765 has started for PR 3785 at commit 9ca6541.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24765 has finished for PR 3785 at commit 9ca6541.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24765/
Test PASSed.

@JoshRosen
Copy link
Contributor

On the surface, this seems like an okay change. I wonder whether this retry logic could have unexpected consequences. Let me try to reason it out:

  • askTracker is only called with GetMapOutputStatuses.
  • In the master actor, it calls getSerializedMapOutputStatuses. This method never throws exceptions: if a shuffle is missing, then it just stores an empty array and serializes it.
  • It's possible that the serialized map statuses could exceed the Akka frame size (although extremely unlikely and perhaps impossible with the new output status compression techniques). In this case, though, the master would throw an exception and fail to send a reply back to the asker. In this case, with this patch we'd end up performing a bunch of retries for an operation that will ultimately fail, so we'll take longer to detect a failure.

In the common cases, though, this seems fine, even if the map output statuses are missing (since it won't introduce a bunch of futile retries). Therefore, I think we should pull this in; I don't know if this fixes an actual bug, but it seems like it could make things more robust.

@JoshRosen
Copy link
Contributor

Alright, I'm going to merge this into master (1.3.0). Thanks!

@asfgit asfgit closed this in 815de54 Dec 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants