spark-729: predictable closure capture #1322

willb · 2014-07-07T21:12:39Z

SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying ClosureCleaner to serialize and deserialize its argument.

This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0. In particular, it ensures that the cloned Broadcast objects produced by closure capture are registered with ContextCleaner so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away.

(See #189 for additional discussion and background.)

This method allows code that needs access to the currently-active ContextCleaner to access it via a DynamicVariable.

The two tests added to ClosureCleanerSuite ensure that variable values are captured at RDD definition time, not at job-execution time.

The environments of serializable closures are now captured as part of closure cleaning. Since we already proactively check most closures for serializability, ClosureCleaner.clean now returns the result of deserializing the serialized version of the cleaned closure. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

There are two possible cases for runJob calls: either they are called by RDD action methods from inside Spark or they are called from client code. There's no need to proactively check the closure argument to runJob for serializability or force variable capture in either case: 1. if they are called by RDD actions, their closure arguments consist of mapping an already-serializable closure (with an already-frozen environment) to each element in the RDD; 2. in both cases, the closure is about to execute and thus the benefit of proactively checking for serializability (or ensuring immediate variable capture) is nonexistent. (Note that ensuring capture via serializability on closure arguments to runJob also causes pyspark accumulators to fail to update.) Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

This splits the test identifying expected failures due to closure serializability into three cases.

Conflicts: core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

AmplabJenkins · 2014-07-07T21:16:04Z

Merged build triggered.

AmplabJenkins · 2014-07-07T21:16:12Z

Merged build started.

AmplabJenkins · 2014-07-07T22:35:44Z

Merged build finished.

AmplabJenkins · 2014-07-07T22:35:44Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16383/

mateiz · 2014-07-23T18:14:41Z

Jenkins, add to whitelist and test this please

mateiz · 2014-07-23T18:15:06Z

@rxin you may want to look at this with your broadcast change

willb · 2014-07-23T18:17:33Z

@mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down. Coincidentally, it's what I was switching context back to when I saw this comment.

@rxin, can you point me to the broadcast change so I can track it?

SparkQA · 2014-07-23T18:18:25Z

QA tests have started for PR 1322. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17050/consoleFull

rxin · 2014-07-23T19:53:36Z

Matei was referring to #1498

willb · 2014-07-23T19:55:41Z

Thanks, @rxin

mateiz · 2014-07-23T23:21:57Z

Alright, just ping me (with @mateiz) when you think it's ready.

mateiz · 2014-08-27T01:16:36Z

BTW @willb, if this is not ready, do you mind closing the PR and resending when it is? We'd like to minimize the number of open PRs that aren't actively being reviewed.

willb · 2014-08-29T17:06:00Z

@mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it. I'll reopen when I have a solution.

mateiz · 2014-08-30T04:18:53Z

Alright, feel free to describe this on the JIRA too if you'd like input.

Co-authored-by: Russell Spitzer <[email protected]>

willb added 10 commits July 4, 2014 11:26

Added reference counting for Broadcasts.

c24b7c8

Added ContextCleaner.withCurrentCleaner

b10f613

This method allows code that needs access to the currently-active ContextCleaner to access it via a DynamicVariable.

Added tests for variable capture in closures

14e074a

The two tests added to ClosureCleanerSuite ensure that variable values are captured at RDD definition time, not at job-execution time.

Split closure-serializability failure tests

f4ed753

This splits the test identifying expected failures due to closure serializability into three cases.

Fixed style issues in tests

b507dd8

Conflicts: core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

Stylistic changes and cleanups

5284569

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

Removed proactive closure serialization from DStream

d6d4930

Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

Support tracking clones of broadcast variables.

c052a63

willb closed this Aug 29, 2014

kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022

rdar://87160738: Bump ADT version to 1.1.3 (apache#1322)

1004a93

Co-authored-by: Russell Spitzer <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-729: predictable closure capture #1322

spark-729: predictable closure capture #1322

willb commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

mateiz commented Jul 23, 2014

mateiz commented Jul 23, 2014

willb commented Jul 23, 2014

SparkQA commented Jul 23, 2014

rxin commented Jul 23, 2014

willb commented Jul 23, 2014

mateiz commented Jul 23, 2014

mateiz commented Aug 27, 2014

willb commented Aug 29, 2014

mateiz commented Aug 30, 2014

spark-729: predictable closure capture #1322

spark-729: predictable closure capture #1322

Conversation

willb commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

AmplabJenkins commented Jul 7, 2014

mateiz commented Jul 23, 2014

mateiz commented Jul 23, 2014

willb commented Jul 23, 2014

SparkQA commented Jul 23, 2014

rxin commented Jul 23, 2014

willb commented Jul 23, 2014

mateiz commented Jul 23, 2014

mateiz commented Aug 27, 2014

willb commented Aug 29, 2014

mateiz commented Aug 30, 2014