Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) #7528

Closed
wants to merge 4 commits into from

Conversation

GraceH
Copy link
Contributor

@GraceH GraceH commented Jul 20, 2015

Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that _killExecutors_ doesn't remove those executors under killing ASAP. It depends on the _OnDisassociated_ event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37814 has finished for PR 7528 at commit b5546ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IllegalArgumentException(Exception):
    • case class DayOfYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class DayOfMonth(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class ConcatWs(children: Seq[Expression])

@GraceH GraceH changed the title Avoid assigning tasks to "lost" executor(s) [SPARK-9193] Avoid assigning tasks to "lost" executor(s) Jul 20, 2015
@squito
Copy link
Contributor

squito commented Jul 20, 2015

Hi @GraceH thanks for finding this bug and posting the PR. The changes look completely reasonable, but would it be possible to add a unit test? I know this is a big ask, since it doesn't look like there are any tests right now on that test anything like this for CoarseGrainedSchedulerBackend, so it will probably take some work to do.

I believe this changes are correct, so the test isn't so much about verifying this fix as it is to prevent future regressions. It would also really help in the future for following along the code -- eg., it took me a while to figure out how anything was ever removed from executorsPendingToRemove.

@GraceH
Copy link
Contributor Author

GraceH commented Jul 21, 2015

@squito Thanks for the confirmation. If you do think it is necessary to add unit test for CoarseGrainedSchedulerBackend, one more JIRA can be filed for that. Because it seems that should cover more than what this PR does. What do you think?

// Filter out executors under killing
.filterKeys(!executorsPendingToRemove.contains(_))
.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little hard to read. Could you rewrite this as follows:

private def makeOffers() {
  val activeExecutors = executorDataMap.filterKeys(!executorsPendingToRemove)
  val workerOffers = activeExecutors.map { case (id, executorData) =>
    new WorkerOffer(id, ...)
  }
  launchTasks(scheduler.resourceOffers(workerOffers))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will re-organize the code snippet.

@andrewor14
Copy link
Contributor

Hi @GraceH this looks great. I left a few minor wording suggestions to improve readability.

As for unit tests, I do agree that we should have them, but it seems outside the scope of this fix since properly testing all of this logic would require a non-trivial refactor. In this case, I am more inclined to merge in this fix than to delay it indefinitely.

@GraceH
Copy link
Contributor Author

GraceH commented Jul 21, 2015

Thanks @andrewor14 . That makes sense. Will have a revised version with more readable lines.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37953 has finished for PR 7528 at commit 6e2ed96.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jul 21, 2015

yeah, I don't love the idea of adding things w/out tests, but in this case I suppose its best left for the future. lgtm pending the tests passing

@GraceH
Copy link
Contributor Author

GraceH commented Jul 21, 2015

Thanks @squito.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37954 has finished for PR 7528 at commit ecc1da6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 6592a60 Jul 21, 2015
asfgit pushed a commit that referenced this pull request Jul 21, 2015
Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that ***killExecutors*** doesn't remove those executors under killing ASAP. It depends on the ***OnDisassociated*** event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

Author: Grace <[email protected]>

Closes #7528 from GraceH/AssignToLostExecutor and squashes the following commits:

ecc1da6 [Grace] scala style fix
6e2ed96 [Grace] Re-word makeOffers by more readable lines
b5546ce [Grace] Add comments about the fix
30a9ad0 [Grace] Avoid assigning tasks to lost executors

(cherry picked from commit 6592a60)
Signed-off-by: Imran Rashid <[email protected]>

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants