Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1860] More conservative app directory cleanup. #2609

Closed
wants to merge 4 commits into from

Conversation

mccheah
Copy link
Contributor

@mccheah mccheah commented Oct 1, 2014

First contribution to the project, so apologize for any significant errors.

This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner.

Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks.

The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

if (appDirs == null) {
throw new IOException("ERROR: Failed to list files in " + appDirs)
}
appDirs.filter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical style here is to put the parameter on the this line, i.e.

appDirs.filter { dir =>
  dir.isDirectory && !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS)
}.foreach(Utils.deleteRecursively)

@aarondav
Copy link
Contributor

aarondav commented Oct 1, 2014

As mentioned in the JIRA, I think it would be very good to also check the appId to make sure the Executors are indeed terminated. It does not seem unreasonable to me that some Spark clusters might remain idle for a couple days before someone comes back to them, with the expectation that they still work.

I think we can achieve this in a pretty type-safe manner by changing the
think we can achieve this in a pretty type-safe manner by changing the ExecutorRunner to take in the "executorWorkDir" instead of "workDir", and thus making Worker have control over the fact that app dirs are named with the app's ID.

@mccheah mccheah force-pushed the worker-better-app-dir-cleanup branch from 620e4a5 to 94891dd Compare October 1, 2014 22:29
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
if (!executorDir.mkdirs()) {
throw new IOException("Failed to create directory " + executorDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the ExecutorStateChanged here does not give any indication of what happened. Would you mind setting the message to Some(e.toString). This will include the Exception's class and message, but not stack trace, which seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm referring to the catch clause of this try.

@aarondav
Copy link
Contributor

aarondav commented Oct 2, 2014

Jenkins, ok to test.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21164/

@mccheah mccheah force-pushed the worker-better-app-dir-cleanup branch 2 times, most recently from a045620 to 7b7cae4 Compare October 2, 2014 01:24
@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit a045620.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit a045620.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21167/

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit 7b7cae4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit 7b7cae4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21168/

Before, the app-* directory was cleaned up whenever its timestamp was
older than a given time. However, the timestamp on a directory may be
older than the timestamps of the files the directory contains. This
change only cleans up app-* directories if all of the directory's
contents are old.
@mccheah mccheah force-pushed the worker-better-app-dir-cleanup branch from 7b7cae4 to 77a9de0 Compare October 2, 2014 01:33
@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit 77a9de0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit 77a9de0.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21169/

if (appDirs == null) {
throw new IOException("ERROR: Failed to list files in " + appDirs)
}
appDirs.filter { dir => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need the extra bracket after the "dir =>". We use the enclosing bracket's scope.

FileUtils.listFiles from apache commons does not list directories. Use
FileUtils.listFilesAndDirs instead.

Also reorganizing a few imports and style changes from pull request.
@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit e0a1f2e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit e0a1f2e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21211/

@@ -174,7 +168,7 @@ private[spark] class ExecutorRunner(
killProcess(None)
}
case e: Exception => {
logError("Error running executor", e)
logError(e.toString, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as it was -- I was referring to this line:
https://github.com/apache/spark/pull/2609/files#diff-916ca56b663f178f302c265b7ef38499R271

val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE)
val cutoffTimeInMillis = (currentTimeMillis - (cutoff * 1000))
val newFiles = files.filter { file => file.lastModified > cutoffTimeInMillis }
(dir.lastModified > cutoffTimeInMillis) || (!newFiles.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newFiles.nonEmpty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileUtils.listFilesAndDirs() appears to include the top-level directory as well, so I don't think we need to special-case it.

@aarondav
Copy link
Contributor

aarondav commented Oct 2, 2014

This change looks good to me. I tested it locally with a small cluster, and it behaves as expected. My main remaining comments are about the logging, as it was pretty opaque when the feature was turned on and when it was actually deleting things.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit 620f52f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit 620f52f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21223/

@mccheah mccheah force-pushed the worker-better-app-dir-cleanup branch from 620f52f to 802473e Compare October 2, 2014 22:03
@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2609 at commit 802473e.

  • This patch merges cleanly.

@@ -242,7 +267,8 @@ private[spark] class Worker(
master ! ExecutorStateChanged(appId, execId, manager.state, None, None)
} catch {
case e: Exception => {
logError("Failed to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
logError("Failed to launch executor %s/%d for %s. Caused by exception: %s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second parameter to log* is an exception, which is printed with full stack trace, so we should use this instead. Additionally, this code was written when Spark was using 2.8 or 2.9, now we can just use string interpolation:

logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)

(The initial "s" activates the string interpolation.)

Additionally, looking on line 276, let's change it to

master ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, Some(e.toString), None)

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2609 at commit 802473e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21224/

@SparkQA
Copy link

SparkQA commented Oct 3, 2014

QA tests have started for PR 2609 at commit 87b5d03.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 3, 2014

QA tests have finished for PR 2609 at commit 87b5d03.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21255/Test PASSed.

@aarondav
Copy link
Contributor

aarondav commented Oct 3, 2014

LGTM, merging into master. Thanks!

@marmbrus
Copy link
Contributor

marmbrus commented Oct 6, 2014

FYI, this broke the build for some versions of Hadoop:

[INFO] Compiling 395 Scala sources and 29 Java sources to <https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/ws/core/target/scala-2.10/classes...>
[ERROR] <https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/ws/core/src/main/scala/org/apache/spark/util/Utils.scala>:720: value listFilesAndDirs is not a member of object org.apache.commons.io.FileUtils
[ERROR]       val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE)
[ERROR]                             ^
[ERROR] one error found

This is being addressed by #2662.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants