-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684
Conversation
QA tests have started for PR 2684 at commit
|
QA tests have finished for PR 2684 at commit
|
Test FAILed. |
QA tests have started for PR 2684 at commit
|
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { | ||
val newJobConf = new JobConf(conf) | ||
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { | ||
val newJobConf = new JobConf(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually clone the internal map? Or does it just create pointers to the supplied conf
? If it just creates pointers it seems like it might end up having the same synchronization issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobConf seems to implement this constructor by calling the superclass's constructor.
Take a look at the git blame
for Configuration:
It looks like this constructor performs proper copying and has done so for a while (since 2009 or 2010).
QA tests have finished for PR 2684 at commit
|
By the way, I checked and this patch cleanly cherry-picks into |
if (conf.isInstanceOf[JobConf]) { | ||
// A user-broadcasted JobConf was provided to the HadoopRDD, so always use it. | ||
conf.asInstanceOf[JobConf] | ||
} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jobConfCacheKey doesn't seem to be used anymore. Should that be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; good catch. I'll remove it.
Test FAILed. |
Looks like Jenkins is being flaky today, since we've been seeing a lot of these "git fetch failed" errors. |
Jenkins, retest this please (testing the new Jenkins). |
QA tests have started for PR 2684 at commit
|
QA tests have finished for PR 2684 at commit
|
Test PASSed. |
Does anyone have additional feedback on this? I have a test at https://gist.github.com/JoshRosen/287630864ac9803fe59f that demonstrates a (different) set of Configuration thread-safety symptoms that this patch fixes. When merging this, please also cherry-pick into |
We were able to verify this fix on 1.0.2 by running a test benchmark job 6 times before and after the patch. We verified by checking the number of output part files for each job. After patching, we verified that for our benchmark jobs none of the part files were missing. During benchmarking, we noticed a 8.69% decrease in performance as measured by the average job time from 5 runs, which is at acceptable levels for us. Let me know if you need any more details. Thanks Josh! |
QA tests have started for PR 2684 at commit
|
@frydawg524 Thanks for testing this out! I'm glad to hear that it solves the bug. I just pushed a new commit which adds a configuration option ( I also plan to open an upstream ticket with Hadoop. That won't solve the problem for Spark users who might be stuck using older Hadoop versions (so we still need our own workaround), but it would be nice to see this eventually get fixed upstream. |
@JoshRosen, Zach |
More flavor on the perf numbers was we ran 6 jobs in a row before and after (starting up a new driver on each job), discarded the first run, and took the average of the remaining five. Pre-patch the times were ~1m50s, post-patch they were ~2m1s. |
QA tests have finished for PR 2684 at commit
|
Test PASSed. |
I'm going to merge this and cherry-pick it into all maintenance branches. We'll probably turn on cloning by default in 1.2 and we'll be sure to clearly document this configuration option in the 1.0.3 and 1.1.1 release notes. Thanks to everyone who helped test this! |
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket). This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues. Author: Josh Rosen <[email protected]> Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits: f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf. b562451 [Josh Rosen] Remove unused jobConfCacheKey field. dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket). This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues. Author: Josh Rosen <[email protected]> Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits: f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf. b562451 [Josh Rosen] Remove unused jobConfCacheKey field. dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task. (cherry picked from commit 2cd40db) Signed-off-by: Josh Rosen <[email protected]> Conflicts: docs/configuration.md
This patch attempts to fix SPARK-2546 in
branch-1.0
andbranch-1.1
. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).This cloning is guarded by a new configuration option (
spark.hadoop.cloneConf
) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.