Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

Closed
wants to merge 1 commit into from

Conversation

simonjscott
Copy link

What changes were proposed in this pull request?

To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.

I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.

How was this patch tested?

Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.

@srowen
Copy link
Member

srowen commented May 20, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58981 has finished for PR 13222 at commit 9ea4dca.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@simonjscott
Copy link
Author

Apologies. Please accept that I am new around here, and of course the change built just fine for me. Looking at the changed file above it really does seem innocent.

I am unsure how to proceed. The only error I can see in the Jenkins output is this:
[error] (streaming-flume-sink/*:update) java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = io.netty#netty;3.4.0.Final

which I'm afraid doesn't help me understand what I should have done better. Thanks

@srowen
Copy link
Member

srowen commented May 20, 2016

That looks like an unrelated failure.

@srowen
Copy link
Member

srowen commented May 20, 2016

Jenkins add to whitelist

@srowen
Copy link
Member

srowen commented May 20, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58985 has finished for PR 13222 at commit 9ea4dca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@simonjscott have you verified that before this fix we still see the exception, and after this fix we no longer do?

@simonjscott
Copy link
Author

@andrewor14 yes I believe so. I first encountered this exception when we deployed a new nightly integration test for our application. The test failed with this exception for six of the first seven nights. After investigation and deploying this fix in a locally-built spark lib, the exception has not recurred at all for the past six weeks.

@srowen
Copy link
Member

srowen commented May 22, 2016

@andrewor14 it seems pretty surgical and could plausibly fix this. Tests pass and @simonjscott is reasonably sure it's the right fix. Are you OK merging?

@andrewor14
Copy link
Contributor

andrewor14 commented May 23, 2016

I can merge it, but I just don't really understand why this fixes the problem (or what the problem was caused by in the first place)

@simonjscott
Copy link
Author

@andrewor14 Allow me to try to clarify my reasoning…

We have reports from at least three users of an exception occurring unpredictably (and not reproducible on demand) while running Spark applications using a “local” master url.

Inspection of the stack trace shows that the class RDDBlockId cannot be loaded while deserializing TaskMetrics. The deserialization is part of a serialize/deserialize combo used (in “local” mode only) to make a copy of the TaskMetrics. Now since everything is running in the same JVM we can reasonably assume that the classpath is properly configured, otherwise how could the RDDBlockId have been created and present in the TaskMetrics in the first place?

Yet the stack trace tells us that the classloader assigned to the heartbeater thread is not able to load RDDBlockId.

So my fix is to tell the deserialize() to use a specific classloader (defined by Utils.getContextOrSparkClassLoader) that is able to load RDDBlockId. A nice property of the fix is that it is entirely localised to the call to deserialize() – there can be no side-effects outside of that call.

Why is the exception so unpredictable? My original conjecture was that the “heartbeat” thread is drawn from a pool and so may be a different actual thread instance on each tick of the heartbeat. And so everything will work until by some chance the actual thread for a given tick is not configured with the right class loader. Another possibility is that RDDBlockIds may be rare and most of the time the heartbeat thread doesn’t encounter them.

I hope that this is helpful.

@srowen
Copy link
Member

srowen commented May 26, 2016

Merged to 1.6

asfgit pushed a commit that referenced this pull request May 26, 2016
## What changes were proposed in this pull request?

To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.

I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.

## How was this patch tested?

Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.

Author: Simon Scott <[email protected]>

Closes #13222 from simonjscott/fix-10722.
@simonjscott simonjscott reopened this May 26, 2016
@srowen
Copy link
Member

srowen commented May 26, 2016

Thanks @simonjscott but yes you will actually have to close this manually -- the bot won't do it and we can't.

@simonjscott
Copy link
Author

@srowen Yes however when I closed it there was a message about unmerged commits, so I re-opened it. Should I close it anyway?

@srowen
Copy link
Member

srowen commented May 26, 2016

It is merged (see 5cc1e2c referenced above). Github doesn't quite know that. It always says there are unmerged commits.

@simonjscott
Copy link
Author

Thank you for reassurance, Closing...

zzcclp pushed a commit to zzcclp/spark that referenced this pull request May 26, 2016
## What changes were proposed in this pull request?

To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.

I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.

## How was this patch tested?

Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.

Author: Simon Scott <[email protected]>

Closes apache#13222 from simonjscott/fix-10722.

(cherry picked from commit 5cc1e2c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants