-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2585] remove unnecessary broadcast for conf #2935
Conversation
Test build #22167 has started for PR 2935 at commit
|
Test build #22167 has finished for PR 2935 at commit
|
Test FAILed. |
Due to a Hadoop thread-safety issue in For some context on this issue, see #2683 and https://issues.apache.org/jira/browse/SPARK-2585 |
BTW, maybe you meant to link to https://issues.apache.org/jira/browse/SPARK-4083? I think this is a duplicate of https://issues.apache.org/jira/browse/SPARK-2585 |
Also, adding our own synchronizing wrapper will let us roll back some of the complexity introduced by #2684 for ensuring thread-safety, since each task will get its own deserialized copy of the configuration. I suppose that this could have a small performance penalty because we'll always construct a new Configuration (which might be expensive), but I think it should be pretty minimal (we can try measuring it) and is probably offset by other performance improvements in 1.2. By the way, CONFIGURATION_INSTANTIATION_LOCK should probably be moved to SparkHadoopUtil so that it's accessible from more places that might create Configurations. |
Oh, one more comment: could you add a |
But inside readFields(), it may call new Configuration(), so we still need to synchronize it here. |
Yeah, we'll still need the synchronization there, but I was hoping to add the |
QA tests have started for PR 2935 at commit
|
ow.readFields(in) | ||
SparkHadoopUtil.CONFIGURATION_INSTANTIATION_LOCK.synchronized { | ||
ow.setConf(SparkHadoopUtil.newConfiguration()) | ||
ow.readFields(in) // not thread safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
Test build #22194 has started for PR 2935 at commit
|
@JoshRosen this PR is ready to review, thanks! |
QA tests have started for PR 2935 at commit
|
Test build #22195 has started for PR 2935 at commit
|
Test FAILed. |
QA tests have finished for PR 2935 at commit
|
Test FAILed. |
QA tests have finished for PR 2935 at commit
|
Test build #22201 has started for PR 2935 at commit
|
Test build #22201 timed out for PR 2935 at commit |
Test FAILed. |
Test build #432 timed out for PR 2935 at commit |
Test build #448 has started for PR 2935 at commit
|
Test build #449 has started for PR 2935 at commit
|
Test build #448 timed out for PR 2935 at commit |
Test build #449 timed out for PR 2935 at commit |
Jenkins, test this please. |
Test build #22221 has started for PR 2935 at commit
|
Test build #22222 has started for PR 2935 at commit
|
Test build #22221 timed out for PR 2935 at commit |
Test FAILed. |
Test build #22222 has finished for PR 2935 at commit
|
Test FAILed. |
Test build #452 has started for PR 2935 at commit
|
Test build #22240 has started for PR 2935 at commit
|
Test build #452 timed out for PR 2935 at commit |
Test build #22240 timed out for PR 2935 at commit |
Test FAILed. |
Conflicts: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
Test build #482 has started for PR 2935 at commit
|
Test build #22347 has started for PR 2935 at commit
|
Test build #482 timed out for PR 2935 at commit |
Test build #22347 timed out for PR 2935 at commit |
Test FAILed. |
The timeout of jenkins test is caused by performance regression introduced in this PR. Deserialization of Hadoop Configuration is pretty slow, after removing broadcast for configuration, each task may have 10-50ms overhead for it. For Hive query, each task may have more than one configuration in one query, the regression is even more. So, we can not remove the broadcast for conf if we can not speed up deserialization of hadoop configuration. |
If we can't find a workaround to speed up the deserialization, do you mind closing this PR for now? |
We already broadcast the task (RDD and closure) itself, so some small data used in RDD or closure do not needed to be broadcasted explicitly any more.