[SPARK-2585] Remove special handling of Hadoop JobConf #2683

JoshRosen · 2014-10-06T22:54:11Z

Previously we broadcast JobConf for HadoopRDD because it is large. Now we always broadcast RDDs and task closures so it should no longer be necessary to broadcast the JobConf anymore.

This is a resubmission of @rxin's #1648 (closes #1648).

Conflicts: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

SparkQA · 2014-10-06T22:59:54Z

QA tests have started for PR 2683 at commit 1d67d9d.

This patch merges cleanly.

JoshRosen · 2014-10-06T23:11:41Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

-      // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
-      // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
-      // Synchronize to prevent ConcurrentModificationException (Spark-1097, Hadoop-10456).
-      HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {


@rxin I'm not sure that it's safe to remove this synchronization around the JobConf constructor. This synchronization was originally added to address HADOOP-10456, which affects a number of Hadoop versions that we still want to be compatible with.

SparkQA · 2014-10-07T00:45:33Z

QA tests have finished for PR 2683 at commit 1d67d9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-07T00:45:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21350/Test PASSed.

JoshRosen · 2014-10-07T21:54:07Z

Due to the CONFIGURATION_INSTANTIATION_LOCK thread-safety issue, I think that we'll still end up having to serialize the Configuration separately. If we didn't, then we'd have to hold CONFIGURATION_INSTANTIATION_LOCK while deserializing each task, which could have a huge performance penalty (it's fine to hold the lock while loading the Configuration, since that doesn't take too long).

Therefore, I don't think that we'll be able to safely remove this complexity, so I'm going to recommend closing this PR in favor of applying the small #2684 fix in 1.2.

rxin and others added 3 commits October 6, 2014 11:11

[SPARK-2585] Remove special handling of Hadoop JobConf.

1980f5e

Conflicts: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

Remove JobConf broadcast comment.

d16102d

Add comment to address Aaron's review comment in apache#1648.

1d67d9d

JoshRosen reviewed Oct 6, 2014
View reviewed changes

JoshRosen closed this Oct 16, 2014

JoshRosen mentioned this pull request Oct 24, 2014

[SPARK-2585] remove unnecessary broadcast for conf #2935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

JoshRosen commented Oct 6, 2014

SparkQA commented Oct 6, 2014

JoshRosen Oct 6, 2014

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

JoshRosen commented Oct 7, 2014

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

[SPARK-2585] Remove special handling of Hadoop JobConf #2683

Conversation

JoshRosen commented Oct 6, 2014

SparkQA commented Oct 6, 2014

JoshRosen Oct 6, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

JoshRosen commented Oct 7, 2014