Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport) #2684

Closed
wants to merge 3 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch attempts to fix SPARK-2546 in branch-1.0 and branch-1.1. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (spark.hadoop.cloneConf) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2684 at commit dd25697.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2684 at commit dd25697.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21352/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have started for PR 2684 at commit dd25697.

  • This patch merges cleanly.

HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
val newJobConf = new JobConf(conf)
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
val newJobConf = new JobConf(conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually clone the internal map? Or does it just create pointers to the supplied conf? If it just creates pointers it seems like it might end up having the same synchronization issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobConf seems to implement this constructor by calling the superclass's constructor.

Take a look at the git blame for Configuration:

https://github.com/apache/hadoop/blame/release-2.5.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L662

It looks like this constructor performs proper copying and has done so for a while (since 2009 or 2010).

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2684 at commit dd25697.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

By the way, I checked and this patch cleanly cherry-picks into branch-1.0.

if (conf.isInstanceOf[JobConf]) {
// A user-broadcasted JobConf was provided to the HadoopRDD, so always use it.
conf.asInstanceOf[JobConf]
} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobConfCacheKey doesn't seem to be used anymore. Should that be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; good catch. I'll remove it.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21419/Test FAILed.

@JoshRosen
Copy link
Contributor Author

Looks like Jenkins is being flaky today, since we've been seeing a lot of these "git fetch failed" errors.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please (testing the new Jenkins).

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2684 at commit b562451.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2684 at commit b562451.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21545/Test PASSed.

@JoshRosen
Copy link
Contributor Author

Does anyone have additional feedback on this? I have a test at https://gist.github.com/JoshRosen/287630864ac9803fe59f that demonstrates a (different) set of Configuration thread-safety symptoms that this patch fixes.

When merging this, please also cherry-pick into master and branch-1.0.0 (I opened this against branch-1.1 because I originally thought that we might explore a different solution for Spark 1.2+).

@fryz
Copy link

fryz commented Oct 17, 2014

We were able to verify this fix on 1.0.2 by running a test benchmark job 6 times before and after the patch.
3/6 tests failed pre-patch and 0/6 failed post-patch.

We verified by checking the number of output part files for each job.
For jobs that failed, when we hit the deadlock, we saw speculation kill and re-attempt the task.
After doing this N times, the task failed and threw java.io.IOException: Failed to save output of task
Ultimately, this lead to the job missing some indeterminate number of the output part files (the ones that failed to commit).

After patching, we verified that for our benchmark jobs none of the part files were missing.

During benchmarking, we noticed a 8.69% decrease in performance as measured by the average job time from 5 runs, which is at acceptable levels for us.

Let me know if you need any more details.

Thanks Josh!

@SparkQA
Copy link

SparkQA commented Oct 17, 2014

QA tests have started for PR 2684 at commit f14f259.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor Author

@frydawg524 Thanks for testing this out! I'm glad to hear that it solves the bug.

I just pushed a new commit which adds a configuration option (spark.hadoop.cloneConf) for controlling whether to clone the configuration (as in the patch you tested) or share a single configuration object across all tasks (the old code). The reasoning for this is that releasing 1.1.1 and 1.0.3 patches that cause measurable performance regressions will upset users who weren't affected by this issue. In 1.2, we may revisit this by seeing if we can find ways to make the cloning process faster.

I also plan to open an upstream ticket with Hadoop. That won't solve the problem for Spark users who might be stuck using older Hadoop versions (so we still need our own workaround), but it would be nice to see this eventually get fixed upstream.

@fryz
Copy link

fryz commented Oct 17, 2014

@JoshRosen,
Awesome! Thanks for helping out with this. I'll make sure that this gets broadcasted to my team.

Zach

@ash211
Copy link
Contributor

ash211 commented Oct 17, 2014

More flavor on the perf numbers was we ran 6 jobs in a row before and after (starting up a new driver on each job), discarded the first run, and took the average of the remaining five.

Pre-patch the times were ~1m50s, post-patch they were ~2m1s.

@SparkQA
Copy link

SparkQA commented Oct 17, 2014

QA tests have finished for PR 2684 at commit f14f259.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21866/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

I'm going to merge this and cherry-pick it into all maintenance branches. We'll probably turn on cloning by default in 1.2 and we'll be sure to clearly document this configuration option in the 1.0.3 and 1.1.1 release notes. Thanks to everyone who helped test this!

asfgit pushed a commit that referenced this pull request Oct 19, 2014
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <[email protected]>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.
asfgit pushed a commit that referenced this pull request Oct 19, 2014
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <[email protected]>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

(cherry picked from commit 2cd40db)
Signed-off-by: Josh Rosen <[email protected]>

Conflicts:
	docs/configuration.md
@asfgit asfgit closed this in 7e63bb4 Oct 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants