-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10722] RDDBlockId not found in driver-heartbeater #13222
Conversation
Jenkins test this please |
Test build #58981 has finished for PR 13222 at commit
|
Apologies. Please accept that I am new around here, and of course the change built just fine for me. Looking at the changed file above it really does seem innocent. I am unsure how to proceed. The only error I can see in the Jenkins output is this: which I'm afraid doesn't help me understand what I should have done better. Thanks |
That looks like an unrelated failure. |
Jenkins add to whitelist |
Jenkins retest this please |
Test build #58985 has finished for PR 13222 at commit
|
@simonjscott have you verified that before this fix we still see the exception, and after this fix we no longer do? |
@andrewor14 yes I believe so. I first encountered this exception when we deployed a new nightly integration test for our application. The test failed with this exception for six of the first seven nights. After investigation and deploying this fix in a locally-built spark lib, the exception has not recurred at all for the past six weeks. |
@andrewor14 it seems pretty surgical and could plausibly fix this. Tests pass and @simonjscott is reasonably sure it's the right fix. Are you OK merging? |
I can merge it, but I just don't really understand why this fixes the problem (or what the problem was caused by in the first place) |
@andrewor14 Allow me to try to clarify my reasoning… We have reports from at least three users of an exception occurring unpredictably (and not reproducible on demand) while running Spark applications using a “local” master url. Inspection of the stack trace shows that the class RDDBlockId cannot be loaded while deserializing TaskMetrics. The deserialization is part of a serialize/deserialize combo used (in “local” mode only) to make a copy of the TaskMetrics. Now since everything is running in the same JVM we can reasonably assume that the classpath is properly configured, otherwise how could the RDDBlockId have been created and present in the TaskMetrics in the first place? Yet the stack trace tells us that the classloader assigned to the heartbeater thread is not able to load RDDBlockId. So my fix is to tell the deserialize() to use a specific classloader (defined by Utils.getContextOrSparkClassLoader) that is able to load RDDBlockId. A nice property of the fix is that it is entirely localised to the call to deserialize() – there can be no side-effects outside of that call. Why is the exception so unpredictable? My original conjecture was that the “heartbeat” thread is drawn from a pool and so may be a different actual thread instance on each tick of the heartbeat. And so everything will work until by some chance the actual thread for a given tick is not configured with the right class loader. Another possibility is that RDDBlockIds may be rare and most of the time the heartbeat thread doesn’t encounter them. I hope that this is helpful. |
Merged to 1.6 |
## What changes were proposed in this pull request? To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat. I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch. ## How was this patch tested? Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now. Author: Simon Scott <[email protected]> Closes #13222 from simonjscott/fix-10722.
Thanks @simonjscott but yes you will actually have to close this manually -- the bot won't do it and we can't. |
@srowen Yes however when I closed it there was a message about unmerged commits, so I re-opened it. Should I close it anyway? |
It is merged (see 5cc1e2c referenced above). Github doesn't quite know that. It always says there are unmerged commits. |
Thank you for reassurance, Closing... |
## What changes were proposed in this pull request? To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat. I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch. ## How was this patch tested? Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now. Author: Simon Scott <[email protected]> Closes apache#13222 from simonjscott/fix-10722. (cherry picked from commit 5cc1e2c)
What changes were proposed in this pull request?
To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.
I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.
How was this patch tested?
Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.