[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

GraceH · 2015-07-20T07:00:07Z

Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that _killExecutors_ doesn't remove those executors under killing ASAP. It depends on the _OnDisassociated_ event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

SparkQA · 2015-07-20T09:09:20Z

Test build #37814 has finished for PR 7528 at commit b5546ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IllegalArgumentException(Exception):
- case class DayOfYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class DayOfMonth(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class ConcatWs(children: Seq[Expression])

squito · 2015-07-20T17:01:33Z

Hi @GraceH thanks for finding this bug and posting the PR. The changes look completely reasonable, but would it be possible to add a unit test? I know this is a big ask, since it doesn't look like there are any tests right now on that test anything like this for CoarseGrainedSchedulerBackend, so it will probably take some work to do.

I believe this changes are correct, so the test isn't so much about verifying this fix as it is to prevent future regressions. It would also really help in the future for following along the code -- eg., it took me a while to figure out how anything was ever removed from executorsPendingToRemove.

GraceH · 2015-07-21T01:15:41Z

@squito Thanks for the confirmation. If you do think it is necessary to add unit test for CoarseGrainedSchedulerBackend, one more JIRA can be filed for that. Because it seems that should cover more than what this PR does. What do you think?

andrewor14 · 2015-07-21T03:51:33Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+        // Filter out executors under killing
+        .filterKeys(!executorsPendingToRemove.contains(_))
+        .map { case (id, executorData) =>
+          new WorkerOffer(id, executorData.executorHost, executorData.freeCores)


This is a little hard to read. Could you rewrite this as follows:

private def makeOffers() { val activeExecutors = executorDataMap.filterKeys(!executorsPendingToRemove) val workerOffers = activeExecutors.map { case (id, executorData) => new WorkerOffer(id, ...) } launchTasks(scheduler.resourceOffers(workerOffers)) }

Good point. Will re-organize the code snippet.

andrewor14 · 2015-07-21T03:57:27Z

Hi @GraceH this looks great. I left a few minor wording suggestions to improve readability.

As for unit tests, I do agree that we should have them, but it seems outside the scope of this fix since properly testing all of this logic would require a non-trivial refactor. In this case, I am more inclined to merge in this fix than to delay it indefinitely.

GraceH · 2015-07-21T04:00:39Z

Thanks @andrewor14 . That makes sense. Will have a revised version with more readable lines.

SparkQA · 2015-07-21T13:27:49Z

Test build #37953 has finished for PR 7528 at commit 6e2ed96.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-07-21T15:26:53Z

yeah, I don't love the idea of adding things w/out tests, but in this case I suppose its best left for the future. lgtm pending the tests passing

GraceH · 2015-07-21T15:43:50Z

Thanks @squito.

SparkQA · 2015-07-21T15:49:44Z

Test build #37954 has finished for PR 7528 at commit ecc1da6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times. The root cause is that ***killExecutors*** doesn't remove those executors under killing ASAP. It depends on the ***OnDisassociated*** event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more. Author: Grace <[email protected]> Closes #7528 from GraceH/AssignToLostExecutor and squashes the following commits: ecc1da6 [Grace] scala style fix 6e2ed96 [Grace] Re-word makeOffers by more readable lines b5546ce [Grace] Add comments about the fix 30a9ad0 [Grace] Avoid assigning tasks to lost executors (cherry picked from commit 6592a60) Signed-off-by: Imran Rashid <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

GraceH added 2 commits July 20, 2015 14:25

Avoid assigning tasks to lost executors

30a9ad0

Add comments about the fix

b5546ce

GraceH changed the title ~~Avoid assigning tasks to "lost" executor(s)~~ [SPARK-9193] Avoid assigning tasks to "lost" executor(s) Jul 20, 2015

andrewor14 reviewed Jul 21, 2015
View reviewed changes

Re-word makeOffers by more readable lines

6e2ed96

scala style fix

ecc1da6

asfgit closed this in 6592a60 Jul 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

GraceH commented Jul 20, 2015

SparkQA commented Jul 20, 2015

squito commented Jul 20, 2015

GraceH commented Jul 21, 2015

andrewor14 Jul 21, 2015

GraceH Jul 21, 2015

andrewor14 commented Jul 21, 2015

GraceH commented Jul 21, 2015

SparkQA commented Jul 21, 2015

squito commented Jul 21, 2015

GraceH commented Jul 21, 2015

SparkQA commented Jul 21, 2015

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

Conversation

GraceH commented Jul 20, 2015

SparkQA commented Jul 20, 2015

squito commented Jul 20, 2015

GraceH commented Jul 21, 2015

andrewor14 Jul 21, 2015

Choose a reason for hiding this comment

GraceH Jul 21, 2015

Choose a reason for hiding this comment

andrewor14 commented Jul 21, 2015

GraceH commented Jul 21, 2015

SparkQA commented Jul 21, 2015

squito commented Jul 21, 2015

GraceH commented Jul 21, 2015

SparkQA commented Jul 21, 2015