[SPARK-20664][core] Delete stale application data from SHS. #20138

vanzin · 2018-01-02T19:32:10Z

Detect the deletion of event log files from storage, and remove
data about the related application attempt in the SHS.

Also contains code to fix SPARK-21571 based on code by ericvandenbergfb.

Detect the deletion of event log files from storage, and remove data about the related application attempt in the SHS. Also contains code to fix SPARK-21571 based on code by ericvandenbergfb.

vanzin · 2018-01-02T19:32:19Z

@ericvandenbergfb

vanzin · 2018-01-02T19:35:44Z

There are some previous comments on this code at: vanzin#40

SparkQA · 2018-01-02T23:14:12Z

Test build #85611 has finished for PR 20138 at commit 9d710d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-01-03T19:39:30Z

@squito

gengliangwang

Overall LGTM since I have reviewed in vanzin#40

gengliangwang · 2018-01-04T16:13:05Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+      .asScala
+      .toList
+    stale.foreach { log =>
+      if (!log.appId.isDefined) {


Nit: log.appId.isEmpty

gengliangwang · 2018-01-04T16:14:34Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+    try {
+      fs.delete(log, true)
+    } catch {
+      case e: AccessControlException =>


Nit: e is not used

SparkQA · 2018-01-04T21:38:11Z

Test build #85686 has finished for PR 20138 at commit 388858b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

still looking, but on the reviews of @ericvandenbergfb 's changes, it seemed like @ajbozarth and @jiangxb1987 were opposed to the more aggressive cleaning by default. I don't see the argument against it, but want to make sure they are aware of that change here.

squito · 2018-01-11T21:37:17Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

          })
+        } catch {
+          // let the iteration over logInfos break, since an exception on


you've renamed logInfos to updated

and actually you've moved the try/catch so this is no longer true, you'll continue to submit all tasks if one throws an exception. (I guess I'm not really sure why the old code did it that way ...)

Maybe we should handle RejectedExecutionException explicitly, under this exception, we can log an error message and stop submit the rest tasks.

Actually RejectedExecutionException shouldn't ever be thrown here. The executor doesn't have a bounded queue, and it's very unlikely you'll ever submit Integer.MAX_VALUE tasks here.

The code didn't use to catch any exception here (it was added along with the comment in a531fe1). Catching the exception doesn't do any harm, I just don't think this code will ever trigger.

ajbozarth · 2018-01-11T22:38:10Z

I haven't had a chance to read though your code, but as @squito said, I am against any default feature that deletes files from the eventLog dir. Many users, such as myself, use one log dir for both the event log as well as their Spark logs. I believe it is a great feature for most use cases and should be available as a option defaulted to off.

squito · 2018-01-11T22:42:36Z

well, perhaps I mis-represented this -- you still need to turn the event log cleaning on explicitly with the old option, "spark.history.fs.cleaner.enabled". This just doesn't include the "aggressive" option that was originally proposed by @ericvandenbergfb

ajbozarth · 2018-01-11T22:48:54Z

Ok, no problems here on that front then. If I have time later to do a proper review and this has't been merged yet I'll take better a look at the whole PR

squito · 2018-01-11T22:49:46Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+        (Some(app.info.id), app.attempts.head.info.attemptId)
+
+      case _ =>
+        (None, None)


I think comment here explaining that writing an entry with no appId will mark this log file as eligible for automatic recovery, if its still in that state after max_log_age. (if I understood correctly)

squito · 2018-01-11T22:55:13Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

@@ -834,6 +906,9 @@ private[history] case class FsHistoryProviderMetadata(

 private[history] case class LogInfo(
    @KVIndexParam logPath: String,
+    @KVIndexParam("lastProcessed") lastProcessed: Long,
+    appId: Option[String],


also a comment here explaining why appId is an Option, as that is unexpected

jiangxb1987 · 2018-01-11T22:55:53Z

I was actually suggesting have the "aggressive" option default turned on, and I'm also fine to not have that config at all. Will take a closer look at this later, thank you for ping me @squito !

squito · 2018-01-11T22:57:46Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

+  }
+
+  test("SPARK-21571: clean up removes invalid history files") {
+    val clock = new ManualClock(TimeUnit.DAYS.toMillis(120))


just curious, why start at 120 days? (not that it matters ...)

This line:

val maxTime = clock.getTimeMillis() - conf.get(MAX_LOG_AGE_S) * 1000

Without that maxTime would be negative and that seems to be triggering a bug somewhere else. I need to take a look at exactly what's happening there, but it seems unrelated to this change.

FYI #20284 fixes the underlying bug.

squito

one small suggestion for an additional test, otherwise lgtm

squito · 2018-01-16T04:56:29Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

+    clock.advance(TimeUnit.DAYS.toMillis(2))
+    provider.checkForLogs()
+    provider.cleanLogs()
+    assert(new File(testDir.toURI).listFiles().size === 0)


I think you should add a case where one file starts out empty, say even for one full day, but then becomes valid before the expiration time, and make sure it does not get cleaned up.

squito · 2018-01-16T21:31:09Z

lgtm

jiangxb1987 · 2018-01-16T21:34:08Z

LGTM

SparkQA · 2018-01-17T00:45:30Z

Test build #86199 has finished for PR 20138 at commit 5bad2af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-01-17T08:04:42Z

This PR introduces an more thoughtful event logs cleanup method, if users have spark.history.fs.cleaner.enabled set to true, their existing invalid event log files(empty/no valid applicationId) that have past the max log age can now be deleted. It should be safe to do so because users should not have files other than event log files in the directory. Can we merge this to master/2.3? @sameeragarwal

gengliangwang · 2018-01-17T08:49:24Z

LGTM and it looks safe.

squito · 2018-01-19T19:25:59Z

as RC1 failed and RC2 is going to be cut soon, I'm going to merge this to master & 2.3

Detect the deletion of event log files from storage, and remove data about the related application attempt in the SHS. Also contains code to fix SPARK-21571 based on code by ericvandenbergfb. Author: Marcelo Vanzin <[email protected]> Closes #20138 from vanzin/SPARK-20664. (cherry picked from commit fed2139) Signed-off-by: Imran Rashid <[email protected]>

[SPARK-20664][core] Delete stale application data from SHS.

9d710d6

Detect the deletion of event log files from storage, and remove data about the related application attempt in the SHS. Also contains code to fix SPARK-21571 based on code by ericvandenbergfb.

gengliangwang reviewed Jan 4, 2018

View reviewed changes

Feedback.

388858b

squito reviewed Jan 11, 2018

View reviewed changes

squito approved these changes Jan 16, 2018

View reviewed changes

More feedback.

5bad2af

asfgit closed this in fed2139 Jan 19, 2018

vanzin deleted the SPARK-20664 branch January 22, 2018 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20664][core] Delete stale application data from SHS. #20138

[SPARK-20664][core] Delete stale application data from SHS. #20138

vanzin commented Jan 2, 2018

vanzin commented Jan 2, 2018

vanzin commented Jan 2, 2018

SparkQA commented Jan 2, 2018

vanzin commented Jan 3, 2018

gengliangwang left a comment

gengliangwang Jan 4, 2018

gengliangwang Jan 4, 2018

SparkQA commented Jan 4, 2018

squito left a comment

squito Jan 11, 2018

squito Jan 11, 2018

jiangxb1987 Jan 15, 2018

vanzin Jan 16, 2018

ajbozarth commented Jan 11, 2018

squito commented Jan 11, 2018

ajbozarth commented Jan 11, 2018

squito Jan 11, 2018

squito Jan 11, 2018

jiangxb1987 commented Jan 11, 2018 •

edited

Loading

squito Jan 11, 2018

vanzin Jan 16, 2018

vanzin Jan 17, 2018

squito left a comment

squito Jan 16, 2018

squito commented Jan 16, 2018

jiangxb1987 commented Jan 16, 2018

SparkQA commented Jan 17, 2018

jiangxb1987 commented Jan 17, 2018

gengliangwang commented Jan 17, 2018

squito commented Jan 19, 2018

[SPARK-20664][core] Delete stale application data from SHS. #20138

[SPARK-20664][core] Delete stale application data from SHS. #20138

Conversation

vanzin commented Jan 2, 2018

vanzin commented Jan 2, 2018

vanzin commented Jan 2, 2018

SparkQA commented Jan 2, 2018

vanzin commented Jan 3, 2018

gengliangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2018

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajbozarth commented Jan 11, 2018

squito commented Jan 11, 2018

ajbozarth commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 commented Jan 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Jan 16, 2018

jiangxb1987 commented Jan 16, 2018

SparkQA commented Jan 17, 2018

jiangxb1987 commented Jan 17, 2018

gengliangwang commented Jan 17, 2018

squito commented Jan 19, 2018

jiangxb1987 commented Jan 11, 2018 •

edited

Loading