[SPARK-1860] More conservative app directory cleanup. #2609

mccheah · 2014-10-01T01:59:32Z

First contribution to the project, so apologize for any significant errors.

This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner.

Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks.

The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale.

AmplabJenkins · 2014-10-01T02:02:09Z

Can one of the admins verify this patch?

aarondav · 2014-10-01T20:00:19Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

+        if (appDirs == null) {
+          throw new IOException("ERROR: Failed to list files in " + appDirs)
+        }
+        appDirs.filter {


The typical style here is to put the parameter on the this line, i.e.

appDirs.filter { dir => dir.isDirectory && !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS) }.foreach(Utils.deleteRecursively)

aarondav · 2014-10-01T20:05:36Z

As mentioned in the JIRA, I think it would be very good to also check the appId to make sure the Executors are indeed terminated. It does not seem unreasonable to me that some Spark clusters might remain idle for a couple days before someone comes back to them, with the expectation that they still work.

I think we can achieve this in a pretty type-safe manner by changing the
think we can achieve this in a pretty type-safe manner by changing the ExecutorRunner to take in the "executorWorkDir" instead of "workDir", and thus making Worker have control over the fact that app dirs are named with the app's ID.

aarondav · 2014-10-02T00:02:43Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

+          // Create the executor's working directory
+          val executorDir = new File(workDir, appId + "/" + execId)
+          if (!executorDir.mkdirs()) {
+            throw new IOException("Failed to create directory " + executorDir)


I just realized that the ExecutorStateChanged here does not give any indication of what happened. Would you mind setting the message to Some(e.toString). This will include the Exception's class and message, but not stack trace, which seems reasonable.

To be clear, I'm referring to the catch clause of this try.

aarondav · 2014-10-02T00:15:11Z

Jenkins, ok to test.

AmplabJenkins · 2014-10-02T00:27:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21164/

SparkQA · 2014-10-02T01:24:30Z

QA tests have started for PR 2609 at commit a045620.

This patch merges cleanly.

SparkQA · 2014-10-02T01:25:30Z

QA tests have finished for PR 2609 at commit a045620.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T01:25:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21167/

SparkQA · 2014-10-02T01:29:31Z

QA tests have started for PR 2609 at commit 7b7cae4.

This patch merges cleanly.

SparkQA · 2014-10-02T01:30:31Z

QA tests have finished for PR 2609 at commit 7b7cae4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T01:30:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21168/

Before, the app-* directory was cleaned up whenever its timestamp was older than a given time. However, the timestamp on a directory may be older than the timestamps of the files the directory contains. This change only cleans up app-* directories if all of the directory's contents are old.

SparkQA · 2014-10-02T01:40:09Z

QA tests have started for PR 2609 at commit 77a9de0.

This patch merges cleanly.

SparkQA · 2014-10-02T02:31:55Z

QA tests have finished for PR 2609 at commit 77a9de0.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T02:31:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21169/

aarondav · 2014-10-02T04:50:05Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

+        if (appDirs == null) {
+          throw new IOException("ERROR: Failed to list files in " + appDirs)
+        }
+        appDirs.filter { dir => {


You do not need the extra bracket after the "dir =>". We use the enclosing bracket's scope.

FileUtils.listFiles from apache commons does not list directories. Use FileUtils.listFilesAndDirs instead. Also reorganizing a few imports and style changes from pull request.

SparkQA · 2014-10-02T17:54:35Z

QA tests have started for PR 2609 at commit e0a1f2e.

This patch merges cleanly.

SparkQA · 2014-10-02T19:01:38Z

QA tests have finished for PR 2609 at commit e0a1f2e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T19:01:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21211/

aarondav · 2014-10-02T19:07:12Z

core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala

@@ -174,7 +168,7 @@ private[spark] class ExecutorRunner(
        killProcess(None)
      }
      case e: Exception => {
-        logError("Error running executor", e)
+        logError(e.toString, e)


I think this is fine as it was -- I was referring to this line:
https://github.com/apache/spark/pull/2609/files#diff-916ca56b663f178f302c265b7ef38499R271

aarondav · 2014-10-02T19:19:00Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE)
+      val cutoffTimeInMillis = (currentTimeMillis - (cutoff * 1000))
+      val newFiles = files.filter { file => file.lastModified > cutoffTimeInMillis }
+      (dir.lastModified > cutoffTimeInMillis) || (!newFiles.isEmpty)


nit: newFiles.nonEmpty

FileUtils.listFilesAndDirs() appears to include the top-level directory as well, so I don't think we need to special-case it.

aarondav · 2014-10-02T19:50:40Z

This change looks good to me. I tested it locally with a small cluster, and it behaves as expected. My main remaining comments are about the logging, as it was pretty opaque when the feature was turned on and when it was actually deleting things.

SparkQA · 2014-10-02T21:54:41Z

QA tests have started for PR 2609 at commit 620f52f.

This patch merges cleanly.

SparkQA · 2014-10-02T21:55:45Z

QA tests have finished for PR 2609 at commit 620f52f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

AmplabJenkins · 2014-10-02T21:55:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21223/

SparkQA · 2014-10-02T22:09:38Z

QA tests have started for PR 2609 at commit 802473e.

This patch merges cleanly.

aarondav · 2014-10-02T22:54:45Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

@@ -242,7 +267,8 @@ private[spark] class Worker(
          master ! ExecutorStateChanged(appId, execId, manager.state, None, None)
        } catch {
          case e: Exception => {
-            logError("Failed to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
+            logError("Failed to launch executor %s/%d for %s. Caused by exception: %s"


The second parameter to log* is an exception, which is printed with full stack trace, so we should use this instead. Additionally, this code was written when Spark was using 2.8 or 2.9, now we can just use string interpolation:

logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)

(The initial "s" activates the string interpolation.)

Additionally, looking on line 276, let's change it to

master ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, Some(e.toString), None)

SparkQA · 2014-10-02T23:18:53Z

QA tests have finished for PR 2609 at commit 802473e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

AmplabJenkins · 2014-10-02T23:18:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21224/

SparkQA · 2014-10-03T17:58:26Z

QA tests have started for PR 2609 at commit 87b5d03.

This patch merges cleanly.

SparkQA · 2014-10-03T19:15:29Z

QA tests have finished for PR 2609 at commit 87b5d03.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-03T19:15:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21255/Test PASSed.

aarondav · 2014-10-03T21:21:42Z

LGTM, merging into master. Thanks!

marmbrus · 2014-10-06T01:39:22Z

FYI, this broke the build for some versions of Hadoop:

[INFO] Compiling 395 Scala sources and 29 Java sources to <https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/ws/core/target/scala-2.10/classes...>
[ERROR] <https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/ws/core/src/main/scala/org/apache/spark/util/Utils.scala>:720: value listFilesAndDirs is not a member of object org.apache.commons.io.FileUtils
[ERROR]       val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE)
[ERROR]                             ^
[ERROR] one error found

This is being addressed by #2662.

aarondav reviewed Oct 1, 2014
View reviewed changes

mccheah force-pushed the worker-better-app-dir-cleanup branch from 620e4a5 to 94891dd Compare October 1, 2014 22:29

aarondav reviewed Oct 2, 2014
View reviewed changes

mccheah force-pushed the worker-better-app-dir-cleanup branch 2 times, most recently from a045620 to 7b7cae4 Compare October 2, 2014 01:24

mccheah force-pushed the worker-better-app-dir-cleanup branch from 7b7cae4 to 77a9de0 Compare October 2, 2014 01:33

aarondav reviewed Oct 2, 2014
View reviewed changes

[SPARK-1860] Fixing broken unit test.

e0a1f2e

FileUtils.listFiles from apache commons does not list directories. Use FileUtils.listFilesAndDirs instead. Also reorganizing a few imports and style changes from pull request.

aarondav reviewed Oct 2, 2014
View reviewed changes

[SPARK-1860] Cleaning up the logs generated when cleaning directories.

802473e

mccheah force-pushed the worker-better-app-dir-cleanup branch from 620f52f to 802473e Compare October 2, 2014 22:03

aarondav reviewed Oct 2, 2014
View reviewed changes

[SPARK-1860] Using more string interpolation. Better error logging.

87b5d03

asfgit closed this in cf1d32e Oct 3, 2014

ash211 mentioned this pull request Oct 5, 2014

SPARK-3794 [CORE] Building spark core fails due to inadvertent dependency on Commons IO #2662

Closed

[SPARK-1860] More conservative app directory cleanup. #2609

[SPARK-1860] More conservative app directory cleanup. #2609

Conversation

mccheah commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

Choose a reason for hiding this comment

aarondav commented Oct 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarondav commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

SparkQA commented Oct 2, 2014

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

SparkQA commented Oct 2, 2014

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

SparkQA commented Oct 2, 2014

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 2, 2014

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarondav commented Oct 2, 2014

SparkQA commented Oct 2, 2014

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

SparkQA commented Oct 2, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

SparkQA commented Oct 3, 2014

SparkQA commented Oct 3, 2014

AmplabJenkins commented Oct 3, 2014

aarondav commented Oct 3, 2014

marmbrus commented Oct 6, 2014