[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

simonjscott · 2016-05-20T10:50:33Z

What changes were proposed in this pull request?

To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.

I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.

How was this patch tested?

Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.

srowen · 2016-05-20T11:35:34Z

Jenkins test this please

SparkQA · 2016-05-20T11:44:19Z

Test build #58981 has finished for PR 13222 at commit 9ea4dca.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

simonjscott · 2016-05-20T12:04:24Z

Apologies. Please accept that I am new around here, and of course the change built just fine for me. Looking at the changed file above it really does seem innocent.

I am unsure how to proceed. The only error I can see in the Jenkins output is this:
[error] (streaming-flume-sink/*:update) java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = io.netty#netty;3.4.0.Final

which I'm afraid doesn't help me understand what I should have done better. Thanks

srowen · 2016-05-20T12:11:32Z

That looks like an unrelated failure.

srowen · 2016-05-20T12:11:36Z

Jenkins add to whitelist

srowen · 2016-05-20T12:11:40Z

Jenkins retest this please

SparkQA · 2016-05-20T14:14:18Z

Test build #58985 has finished for PR 13222 at commit 9ea4dca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-20T18:00:40Z

@simonjscott have you verified that before this fix we still see the exception, and after this fix we no longer do?

simonjscott · 2016-05-20T19:02:13Z

@andrewor14 yes I believe so. I first encountered this exception when we deployed a new nightly integration test for our application. The test failed with this exception for six of the first seven nights. After investigation and deploying this fix in a locally-built spark lib, the exception has not recurred at all for the past six weeks.

srowen · 2016-05-22T13:14:39Z

@andrewor14 it seems pretty surgical and could plausibly fix this. Tests pass and @simonjscott is reasonably sure it's the right fix. Are you OK merging?

andrewor14 · 2016-05-23T18:07:53Z

I can merge it, but I just don't really understand why this fixes the problem (or what the problem was caused by in the first place)

simonjscott · 2016-05-24T10:25:25Z

@andrewor14 Allow me to try to clarify my reasoning…

We have reports from at least three users of an exception occurring unpredictably (and not reproducible on demand) while running Spark applications using a “local” master url.

Inspection of the stack trace shows that the class RDDBlockId cannot be loaded while deserializing TaskMetrics. The deserialization is part of a serialize/deserialize combo used (in “local” mode only) to make a copy of the TaskMetrics. Now since everything is running in the same JVM we can reasonably assume that the classpath is properly configured, otherwise how could the RDDBlockId have been created and present in the TaskMetrics in the first place?

Yet the stack trace tells us that the classloader assigned to the heartbeater thread is not able to load RDDBlockId.

So my fix is to tell the deserialize() to use a specific classloader (defined by Utils.getContextOrSparkClassLoader) that is able to load RDDBlockId. A nice property of the fix is that it is entirely localised to the call to deserialize() – there can be no side-effects outside of that call.

Why is the exception so unpredictable? My original conjecture was that the “heartbeat” thread is drawn from a pool and so may be a different actual thread instance on each tick of the heartbeat. And so everything will work until by some chance the actual thread for a given tick is not configured with the right class loader. Another possibility is that RDDBlockIds may be rare and most of the time the heartbeat thread doesn’t encounter them.

I hope that this is helpful.

srowen · 2016-05-26T13:13:54Z

Merged to 1.6

## What changes were proposed in this pull request? To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat. I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch. ## How was this patch tested? Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now. Author: Simon Scott <[email protected]> Closes #13222 from simonjscott/fix-10722.

srowen · 2016-05-26T13:27:54Z

Thanks @simonjscott but yes you will actually have to close this manually -- the bot won't do it and we can't.

simonjscott · 2016-05-26T13:31:44Z

@srowen Yes however when I closed it there was a message about unmerged commits, so I re-opened it. Should I close it anyway?

srowen · 2016-05-26T13:32:54Z

It is merged (see 5cc1e2c referenced above). Github doesn't quite know that. It always says there are unmerged commits.

simonjscott · 2016-05-26T13:33:41Z

Thank you for reassurance, Closing...

## What changes were proposed in this pull request? To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat. I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch. ## How was this patch tested? Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now. Author: Simon Scott <[email protected]> Closes apache#13222 from simonjscott/fix-10722. (cherry picked from commit 5cc1e2c)

[SPARK-10722] RDDBlockId not found in driver-heartbeater

9ea4dca

simonjscott closed this May 26, 2016

simonjscott reopened this May 26, 2016

simonjscott closed this May 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

simonjscott commented May 20, 2016

srowen commented May 20, 2016

SparkQA commented May 20, 2016

simonjscott commented May 20, 2016

srowen commented May 20, 2016

srowen commented May 20, 2016

srowen commented May 20, 2016

SparkQA commented May 20, 2016

andrewor14 commented May 20, 2016

simonjscott commented May 20, 2016

srowen commented May 22, 2016

andrewor14 commented May 23, 2016 •

edited

Loading

simonjscott commented May 24, 2016

srowen commented May 26, 2016

srowen commented May 26, 2016

simonjscott commented May 26, 2016

srowen commented May 26, 2016

simonjscott commented May 26, 2016

[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222

Conversation

simonjscott commented May 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented May 20, 2016

SparkQA commented May 20, 2016

simonjscott commented May 20, 2016

srowen commented May 20, 2016

srowen commented May 20, 2016

srowen commented May 20, 2016

SparkQA commented May 20, 2016

andrewor14 commented May 20, 2016

simonjscott commented May 20, 2016

srowen commented May 22, 2016

andrewor14 commented May 23, 2016 • edited Loading

simonjscott commented May 24, 2016

srowen commented May 26, 2016

srowen commented May 26, 2016

simonjscott commented May 26, 2016

srowen commented May 26, 2016

simonjscott commented May 26, 2016

andrewor14 commented May 23, 2016 •

edited

Loading