Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2585] remove unnecessary broadcast for conf #2935

Closed
wants to merge 11 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Oct 24, 2014

We already broadcast the task (RDD and closure) itself, so some small data used in RDD or closure do not needed to be broadcasted explicitly any more.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22167 has started for PR 2935 at commit b4cd73e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22167 has finished for PR 2935 at commit b4cd73e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22167/
Test FAILed.

@JoshRosen
Copy link
Contributor

Due to a Hadoop thread-safety issue in Configuration's constructor, we need to hold a lock in any code that might call new Configuration() on the executor. We don't want to hold the lock for the entire task deserialization because that won't allow tasks to be launched in parallel. Instead, though, we could make our own wrapper that holds a Configuration and grabs CONFIGURATION_INSTANTIATION_LOCK in its readObject or readExternal method.

For some context on this issue, see #2683 and https://issues.apache.org/jira/browse/SPARK-2585

@JoshRosen
Copy link
Contributor

BTW, maybe you meant to link to https://issues.apache.org/jira/browse/SPARK-4083? I think this is a duplicate of https://issues.apache.org/jira/browse/SPARK-2585

@JoshRosen
Copy link
Contributor

Also, adding our own synchronizing wrapper will let us roll back some of the complexity introduced by #2684 for ensuring thread-safety, since each task will get its own deserialized copy of the configuration.

I suppose that this could have a small performance penalty because we'll always construct a new Configuration (which might be expensive), but I think it should be pretty minimal (we can try measuring it) and is probably offset by other performance improvements in 1.2.

By the way, CONFIGURATION_INSTANTIATION_LOCK should probably be moved to SparkHadoopUtil so that it's accessible from more places that might create Configurations.

@davies davies changed the title [SPARK-4082] remove unnecessary broadcast for conf [SPARK-2585] remove unnecessary broadcast for conf Oct 24, 2014
@JoshRosen
Copy link
Contributor

Oh, one more comment: could you add a newConfiguration() method to SparkHadoopUtil that synchronizes on the lock and returns new Configuration()? This would help to simplify some of the code in the streaming WAL patch which needs to create new configurations.

@davies
Copy link
Contributor Author

davies commented Oct 25, 2014

But inside readFields(), it may call new Configuration(), so we still need to synchronize it here.

@JoshRosen
Copy link
Contributor

Yeah, we'll still need the synchronization there, but I was hoping to add the newConfiguration() to address the use-cases in Streaming, where they construct new configurations.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

QA tests have started for PR 2935 at commit bc46dda.

  • This patch does not merge cleanly.

ow.readFields(in)
SparkHadoopUtil.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
ow.setConf(SparkHadoopUtil.newConfiguration())
ow.readFields(in) // not thread safe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22194 has started for PR 2935 at commit 8694cb3.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Oct 25, 2014

@JoshRosen this PR is ready to review, thanks!

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

QA tests have started for PR 2935 at commit bc46dda.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22195 has started for PR 2935 at commit 32bd815.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22193/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

QA tests have finished for PR 2935 at commit bc46dda.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22192/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

QA tests have finished for PR 2935 at commit bc46dda.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22201 has started for PR 2935 at commit 1fd70df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22201 timed out for PR 2935 at commit 1fd70df after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22201/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #432 timed out for PR 2935 at commit 32bd815 after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #448 has started for PR 2935 at commit 1fd70df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #449 has started for PR 2935 at commit 1fd70df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #448 timed out for PR 2935 at commit 1fd70df after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #449 timed out for PR 2935 at commit 1fd70df after a configured wait of 120m.

@davies
Copy link
Contributor Author

davies commented Oct 25, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22221 has started for PR 2935 at commit 1fd70df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22222 has started for PR 2935 at commit ae32e92.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22221 timed out for PR 2935 at commit 1fd70df after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22221/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22222 has finished for PR 2935 at commit ae32e92.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22222/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #452 has started for PR 2935 at commit ae32e92.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22240 has started for PR 2935 at commit 63f2972.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #452 timed out for PR 2935 at commit ae32e92 after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22240 timed out for PR 2935 at commit 63f2972 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22240/
Test FAILed.

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #482 has started for PR 2935 at commit 63f2972.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22347 has started for PR 2935 at commit 5329091.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #482 timed out for PR 2935 at commit 63f2972 after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22347 timed out for PR 2935 at commit 5329091 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22347/
Test FAILed.

@davies
Copy link
Contributor Author

davies commented Oct 28, 2014

The timeout of jenkins test is caused by performance regression introduced in this PR. Deserialization of Hadoop Configuration is pretty slow, after removing broadcast for configuration, each task may have 10-50ms overhead for it. For Hive query, each task may have more than one configuration in one query, the regression is even more.

So, we can not remove the broadcast for conf if we can not speed up deserialization of hadoop configuration.

@JoshRosen
Copy link
Contributor

If we can't find a workaround to speed up the deserialization, do you mind closing this PR for now?

@davies davies closed this Oct 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants