-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1860] More conservative app directory cleanup. #2609
Conversation
Can one of the admins verify this patch? |
if (appDirs == null) { | ||
throw new IOException("ERROR: Failed to list files in " + appDirs) | ||
} | ||
appDirs.filter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typical style here is to put the parameter on the this line, i.e.
appDirs.filter { dir =>
dir.isDirectory && !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS)
}.foreach(Utils.deleteRecursively)
As mentioned in the JIRA, I think it would be very good to also check the appId to make sure the Executors are indeed terminated. It does not seem unreasonable to me that some Spark clusters might remain idle for a couple days before someone comes back to them, with the expectation that they still work. I think we can achieve this in a pretty type-safe manner by changing the |
620e4a5
to
94891dd
Compare
// Create the executor's working directory | ||
val executorDir = new File(workDir, appId + "/" + execId) | ||
if (!executorDir.mkdirs()) { | ||
throw new IOException("Failed to create directory " + executorDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the ExecutorStateChanged here does not give any indication of what happened. Would you mind setting the message to Some(e.toString)
. This will include the Exception's class and message, but not stack trace, which seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm referring to the catch
clause of this try.
Jenkins, ok to test. |
Test FAILed. |
a045620
to
7b7cae4
Compare
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test FAILed. |
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test FAILed. |
Before, the app-* directory was cleaned up whenever its timestamp was older than a given time. However, the timestamp on a directory may be older than the timestamps of the files the directory contains. This change only cleans up app-* directories if all of the directory's contents are old.
7b7cae4
to
77a9de0
Compare
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test FAILed. |
if (appDirs == null) { | ||
throw new IOException("ERROR: Failed to list files in " + appDirs) | ||
} | ||
appDirs.filter { dir => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need the extra bracket after the "dir =>". We use the enclosing bracket's scope.
FileUtils.listFiles from apache commons does not list directories. Use FileUtils.listFilesAndDirs instead. Also reorganizing a few imports and style changes from pull request.
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test PASSed. |
@@ -174,7 +168,7 @@ private[spark] class ExecutorRunner( | |||
killProcess(None) | |||
} | |||
case e: Exception => { | |||
logError("Error running executor", e) | |||
logError(e.toString, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as it was -- I was referring to this line:
https://github.com/apache/spark/pull/2609/files#diff-916ca56b663f178f302c265b7ef38499R271
val files = FileUtils.listFilesAndDirs(dir, TrueFileFilter.TRUE, TrueFileFilter.TRUE) | ||
val cutoffTimeInMillis = (currentTimeMillis - (cutoff * 1000)) | ||
val newFiles = files.filter { file => file.lastModified > cutoffTimeInMillis } | ||
(dir.lastModified > cutoffTimeInMillis) || (!newFiles.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newFiles.nonEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileUtils.listFilesAndDirs() appears to include the top-level directory as well, so I don't think we need to special-case it.
This change looks good to me. I tested it locally with a small cluster, and it behaves as expected. My main remaining comments are about the logging, as it was pretty opaque when the feature was turned on and when it was actually deleting things. |
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test FAILed. |
620f52f
to
802473e
Compare
QA tests have started for PR 2609 at commit
|
@@ -242,7 +267,8 @@ private[spark] class Worker( | |||
master ! ExecutorStateChanged(appId, execId, manager.state, None, None) | |||
} catch { | |||
case e: Exception => { | |||
logError("Failed to launch executor %s/%d for %s".format(appId, execId, appDesc.name)) | |||
logError("Failed to launch executor %s/%d for %s. Caused by exception: %s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second parameter to log* is an exception, which is printed with full stack trace, so we should use this instead. Additionally, this code was written when Spark was using 2.8 or 2.9, now we can just use string interpolation:
logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
(The initial "s" activates the string interpolation.)
Additionally, looking on line 276, let's change it to
master ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, Some(e.toString), None)
QA tests have finished for PR 2609 at commit
|
Test PASSed. |
QA tests have started for PR 2609 at commit
|
QA tests have finished for PR 2609 at commit
|
Test PASSed. |
LGTM, merging into master. Thanks! |
FYI, this broke the build for some versions of Hadoop:
This is being addressed by #2662. |
First contribution to the project, so apologize for any significant errors.
This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner.
Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks.
The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale.