Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

Closed
wants to merge 3 commits into from

Conversation

JoshRosen
Copy link
Contributor

Previously we broadcast JobConf for HadoopRDD because it is large. Now we always broadcast RDDs and task closures so it should no longer be necessary to broadcast the JobConf anymore.

This is a resubmission of @rxin's #1648 (closes #1648).

rxin and others added 3 commits October 6, 2014 11:11
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2683 at commit 1d67d9d.

  • This patch merges cleanly.

// local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
// The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
// Synchronize to prevent ConcurrentModificationException (Spark-1097, Hadoop-10456).
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I'm not sure that it's safe to remove this synchronization around the JobConf constructor. This synchronization was originally added to address HADOOP-10456, which affects a number of Hadoop versions that we still want to be compatible with.

@SparkQA
Copy link

SparkQA commented Oct 7, 2014

QA tests have finished for PR 2683 at commit 1d67d9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21350/Test PASSed.

@JoshRosen
Copy link
Contributor Author

Due to the CONFIGURATION_INSTANTIATION_LOCK thread-safety issue, I think that we'll still end up having to serialize the Configuration separately. If we didn't, then we'd have to hold CONFIGURATION_INSTANTIATION_LOCK while deserializing each task, which could have a huge performance penalty (it's fine to hold the lock while loading the Configuration, since that doesn't take too long).

Therefore, I don't think that we'll be able to safely remove this complexity, so I'm going to recommend closing this PR in favor of applying the small #2684 fix in 1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants