[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

JoshRosen · 2014-10-06T23:33:37Z

This patch attempts to fix SPARK-2546 in branch-1.0 and branch-1.1. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (spark.hadoop.cloneConf) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

SparkQA · 2014-10-06T23:39:55Z

QA tests have started for PR 2684 at commit dd25697.

This patch merges cleanly.

SparkQA · 2014-10-07T00:53:56Z

QA tests have finished for PR 2684 at commit dd25697.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-07T00:54:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21352/Test FAILed.

SparkQA · 2014-10-07T02:27:23Z

QA tests have started for PR 2684 at commit dd25697.

This patch merges cleanly.

pwendell · 2014-10-07T02:29:46Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

-      HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
-        val newJobConf = new JobConf(conf)
+    HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
+      val newJobConf = new JobConf(conf)


Does this actually clone the internal map? Or does it just create pointers to the supplied conf? If it just creates pointers it seems like it might end up having the same synchronization issues.

JobConf seems to implement this constructor by calling the superclass's constructor.

Take a look at the git blame for Configuration:

https://github.com/apache/hadoop/blame/release-2.5.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L662

It looks like this constructor performs proper copying and has done so for a while (since 2009 or 2010).

SparkQA · 2014-10-07T03:50:38Z

QA tests have finished for PR 2684 at commit dd25697.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-10-07T06:31:25Z

By the way, I checked and this patch cleanly cherry-picks into branch-1.0.

mingyukim · 2014-10-07T22:34:20Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

-    if (conf.isInstanceOf[JobConf]) {
-      // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it.
-      conf.asInstanceOf[JobConf]
-    } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {


jobConfCacheKey doesn't seem to be used anymore. Should that be removed?

You're right; good catch. I'll remove it.

AmplabJenkins · 2014-10-07T23:02:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21419/Test FAILed.

JoshRosen · 2014-10-07T23:19:50Z

Looks like Jenkins is being flaky today, since we've been seeing a lot of these "git fetch failed" errors.

JoshRosen · 2014-10-09T19:39:10Z

Jenkins, retest this please (testing the new Jenkins).

SparkQA · 2014-10-09T19:45:17Z

QA tests have started for PR 2684 at commit b562451.

This patch merges cleanly.

SparkQA · 2014-10-09T21:04:35Z

QA tests have finished for PR 2684 at commit b562451.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T21:04:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21545/Test PASSed.

JoshRosen · 2014-10-16T18:31:26Z

Does anyone have additional feedback on this? I have a test at https://gist.github.com/JoshRosen/287630864ac9803fe59f that demonstrates a (different) set of Configuration thread-safety symptoms that this patch fixes.

When merging this, please also cherry-pick into master and branch-1.0.0 (I opened this against branch-1.1 because I originally thought that we might explore a different solution for Spark 1.2+).

fryz · 2014-10-17T20:12:26Z

We were able to verify this fix on 1.0.2 by running a test benchmark job 6 times before and after the patch.
3/6 tests failed pre-patch and 0/6 failed post-patch.

We verified by checking the number of output part files for each job.
For jobs that failed, when we hit the deadlock, we saw speculation kill and re-attempt the task.
After doing this N times, the task failed and threw java.io.IOException: Failed to save output of task
Ultimately, this lead to the job missing some indeterminate number of the output part files (the ones that failed to commit).

After patching, we verified that for our benchmark jobs none of the part files were missing.

During benchmarking, we noticed a 8.69% decrease in performance as measured by the average job time from 5 runs, which is at acceptable levels for us.

Let me know if you need any more details.

Thanks Josh!

SparkQA · 2014-10-17T21:19:54Z

QA tests have started for PR 2684 at commit f14f259.

This patch merges cleanly.

JoshRosen · 2014-10-17T21:21:14Z

@frydawg524 Thanks for testing this out! I'm glad to hear that it solves the bug.

I just pushed a new commit which adds a configuration option (spark.hadoop.cloneConf) for controlling whether to clone the configuration (as in the patch you tested) or share a single configuration object across all tasks (the old code). The reasoning for this is that releasing 1.1.1 and 1.0.3 patches that cause measurable performance regressions will upset users who weren't affected by this issue. In 1.2, we may revisit this by seeing if we can find ways to make the cloning process faster.

I also plan to open an upstream ticket with Hadoop. That won't solve the problem for Spark users who might be stuck using older Hadoop versions (so we still need our own workaround), but it would be nice to see this eventually get fixed upstream.

fryz · 2014-10-17T21:27:32Z

@JoshRosen,
Awesome! Thanks for helping out with this. I'll make sure that this gets broadcasted to my team.

Zach

ash211 · 2014-10-17T21:56:10Z

More flavor on the perf numbers was we ran 6 jobs in a row before and after (starting up a new driver on each job), discarded the first run, and took the average of the remaining five.

Pre-patch the times were ~1m50s, post-patch they were ~2m1s.

SparkQA · 2014-10-17T22:41:05Z

QA tests have finished for PR 2684 at commit f14f259.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-17T22:41:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21866/
Test PASSed.

JoshRosen · 2014-10-19T07:29:47Z

I'm going to merge this and cherry-pick it into all maintenance branches. We'll probably turn on cloning by default in 1.2 and we'll be sure to clearly document this configuration option in the 1.0.3 and 1.1.1 release notes. Thanks to everyone who helped test this!

This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket). This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues. Author: Josh Rosen <[email protected]> Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits: f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf. b562451 [Josh Rosen] Remove unused jobConfCacheKey field. dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket). This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues. Author: Josh Rosen <[email protected]> Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits: f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf. b562451 [Josh Rosen] Remove unused jobConfCacheKey field. dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task. (cherry picked from commit 2cd40db) Signed-off-by: Josh Rosen <[email protected]> Conflicts: docs/configuration.md

[SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

dd25697

pwendell reviewed Oct 7, 2014
View reviewed changes

JoshRosen mentioned this pull request Oct 7, 2014

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

Closed

mingyukim reviewed Oct 7, 2014
View reviewed changes

Remove unused jobConfCacheKey field.

b562451

Add configuration option to control cloning of Hadoop JobConf.

f14f259

asfgit closed this in 7e63bb4 Oct 19, 2014

JoshRosen mentioned this pull request Oct 24, 2014

[SPARK-2585] remove unnecessary broadcast for conf #2935

Closed

srowen mentioned this pull request Sep 15, 2015

SPARK-10611 Clone Configuration for each task for NewHadoopRDD #8763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

JoshRosen commented Oct 6, 2014

SparkQA commented Oct 6, 2014

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

SparkQA commented Oct 7, 2014

pwendell Oct 7, 2014

JoshRosen Oct 7, 2014

SparkQA commented Oct 7, 2014

JoshRosen commented Oct 7, 2014

mingyukim Oct 7, 2014

JoshRosen Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

JoshRosen commented Oct 7, 2014

JoshRosen commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

JoshRosen commented Oct 16, 2014

fryz commented Oct 17, 2014

SparkQA commented Oct 17, 2014

JoshRosen commented Oct 17, 2014

fryz commented Oct 17, 2014

ash211 commented Oct 17, 2014

SparkQA commented Oct 17, 2014

AmplabJenkins commented Oct 17, 2014

JoshRosen commented Oct 19, 2014

[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

Conversation

JoshRosen commented Oct 6, 2014

SparkQA commented Oct 6, 2014

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

SparkQA commented Oct 7, 2014

pwendell Oct 7, 2014

Choose a reason for hiding this comment

JoshRosen Oct 7, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2014

JoshRosen commented Oct 7, 2014

mingyukim Oct 7, 2014

Choose a reason for hiding this comment

JoshRosen Oct 7, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Oct 7, 2014

JoshRosen commented Oct 7, 2014

JoshRosen commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

JoshRosen commented Oct 16, 2014

fryz commented Oct 17, 2014

SparkQA commented Oct 17, 2014

JoshRosen commented Oct 17, 2014

fryz commented Oct 17, 2014

ash211 commented Oct 17, 2014

SparkQA commented Oct 17, 2014

AmplabJenkins commented Oct 17, 2014

JoshRosen commented Oct 19, 2014