[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

kayousterhout · 2014-04-03T01:30:51Z

[I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305. You can look at just the second commit: https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac to see just the changes relevant to this PR]

Previously, when stages fail or get cancelled, the SparkListener is only notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
stage that failed. This worked before job cancellation, because jobs would only fail
due to a single stage failure. However, with job cancellation, multiple running stages
can fail when a job gets cancelled. Right now, this is not handled correctly, which
results in stages that get stuck in the “Running Stages” window in the UI even
though they’re dead.

This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in addition to when
they complete successfully. This change is NOT publicly backward compatible for two
reasons. First, it changes the SparkListener interface. We could alternately add a new event,
SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However,
this is less consistent with the listener events for tasks / jobs ending, and will result in some
code duplication for listeners (because failed and completed stages are handled in similar
ways). Note that I haven’t finished updating the JSON code to correctly handle the new event
because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).

It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
method to no longer include a stage that caused the failure. I think this change should definitely
stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
single stage.

AmplabJenkins · 2014-04-03T01:32:23Z

Merged build triggered.

AmplabJenkins · 2014-04-03T01:32:30Z

Merged build started.

AmplabJenkins · 2014-04-03T03:03:27Z

Merged build finished.

AmplabJenkins · 2014-04-03T03:03:27Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13707/

pwendell · 2014-04-03T05:26:38Z

Jenkins, test this please

AmplabJenkins · 2014-04-03T05:27:23Z

Merged build triggered.

AmplabJenkins · 2014-04-03T05:27:31Z

Merged build started.

AmplabJenkins · 2014-04-03T06:21:53Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-03T06:21:53Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13717/

markhamstra · 2014-04-03T21:15:45Z

Just a nomenclature note: The division between StageCompleted and StageFailed or StageEnded is not really consistent with Akka/Scala Futures, where Completed doesn't imply success, but rather completed futures are instances of either Success or Failure -- http://docs.scala-lang.org/overviews/core/futures.html. It wouldn't be the worst thing if we adopted our own semantics, but on the other hand, it would be less confusing to be consistent across Futures and Stage completion.

markhamstra · 2014-04-04T21:51:31Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+   * Fails a job and all stages that are only used by that job, and cleans up relevant state.
+   *
+   * @param resultStage The result stage for the job, if known. Used to cleanup state for the job
+   *                    slightly more efficiently than when not specified.


Not implemented

Which part of this did you think should say not implemented? resultStage is an optional parameter so I don't think it makes sense to talk about it not being implemented

Sorry, too brief. I just meant that resultStage isn't being used in failJobAndIndependentStages, so the commented optimization isn't implemented.

Ah good catch thanks for pointing that out!!

markhamstra · 2014-04-04T22:04:14Z

Aside from the naming issue, the few nits I noted and the in-progress elements, 305 + 309 is looking pretty good to me.

kayousterhout · 2014-04-04T22:05:49Z

Cool, thanks @markhamstra ! I'll make the changes you suggested.

Given what you said about the naming and also the pain of changing the name (both for me to do and for others who have written Spark Listeners) it sounds like it makes sense to keep SparkListenerStageCompleted as the event for both when a stage ends successfully and when it fails. Does this seem reasonable to you, @pwendell ?

SPARK-544. Migrate configuration to a SparkConf class This is still a work in progress based on Prashant and Evan's code. So far I've done the following: - Got rid of global SparkContext.globalConf - Passed SparkConf to serializers and compression codecs - Made SparkConf public instead of private[spark] - Improved API of SparkContext and SparkConf - Switched executor environment vars to be passed through SparkConf - Fixed some places that were still using system properties - Fixed some tests, though others are still failing This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation). But the API at least is hopefully ready for review. Unfortunately there was a lot of global stuff before due to a "SparkContext.globalConf" method that let you set a "default" configuration of sorts, which meant I had to make some pretty big changes.

Previously, when stages fail or get cancelled, the SparkListener is only notified indirectly through the SparkListenerJobEnd, where we sometimes pass in a single stage that failed. This worked before job cancellation, because jobs would only fail due to a single stage failure. However, with job cancellation, multiple running stages can fail when a job gets cancelled. Right now, this is not handled correctly, which results in stages that get stuck in the “Running Stages” window in the UI even though they’re dead. This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded event, and uses this event to tell SparkListeners when stages fail in addition to when they complete successfully. This change is NOT publicly backward compatible for two reasons. First, it changes the SparkListener interface. We could alternately add a new event, SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However, this is less consistent with the listener events for tasks / jobs ending, and will result in some code duplication for listeners (because failed and completed stages are handled in similar ways). Note that I haven’t finished updating the JSON code to correctly handle the new event because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”). It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed() method to no longer include a stage that caused the failure. I think this change should definitely stay, because with cancellation (as described above), a failure isn’t necessarily caused by a single stage.

pwendell · 2014-04-08T18:03:40Z

Sounds good to me.

AmplabJenkins · 2014-04-08T19:22:23Z

Merged build triggered.

AmplabJenkins · 2014-04-08T19:22:32Z

Merged build started.

kayousterhout · 2014-04-08T19:23:36Z

Ok this is now ready -- fixed the things you comment on @markhamstra , rebased on master (so this no longer includes PR 305, which has been merged). Will merge later today unless anyone has any future comments! Thanks for the review!

AmplabJenkins · 2014-04-08T19:27:23Z

Merged build triggered.

AmplabJenkins · 2014-04-08T19:27:32Z

Merged build started.

AmplabJenkins · 2014-04-08T19:32:42Z

Merged build finished.

AmplabJenkins · 2014-04-08T19:32:42Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13902/

kayousterhout · 2014-04-08T19:46:10Z

Jenkins, retest this please

AmplabJenkins · 2014-04-08T19:47:23Z

Merged build triggered.

AmplabJenkins · 2014-04-08T19:47:32Z

Merged build started.

AmplabJenkins · 2014-04-08T20:21:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-08T20:21:43Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13904/

AmplabJenkins · 2014-04-08T20:32:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-08T20:32:43Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13906/

pwendell · 2014-04-08T21:41:36Z

@kayousterhout merging this per our offline discussion.

[I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, apache#305. You can look at just the second commit: kayousterhout@93f08ba to see just the changes relevant to this PR] Previously, when stages fail or get cancelled, the SparkListener is only notified indirectly through the SparkListenerJobEnd, where we sometimes pass in a single stage that failed. This worked before job cancellation, because jobs would only fail due to a single stage failure. However, with job cancellation, multiple running stages can fail when a job gets cancelled. Right now, this is not handled correctly, which results in stages that get stuck in the “Running Stages” window in the UI even though they’re dead. This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded event, and uses this event to tell SparkListeners when stages fail in addition to when they complete successfully. This change is NOT publicly backward compatible for two reasons. First, it changes the SparkListener interface. We could alternately add a new event, SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However, this is less consistent with the listener events for tasks / jobs ending, and will result in some code duplication for listeners (because failed and completed stages are handled in similar ways). Note that I haven’t finished updating the JSON code to correctly handle the new event because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”). It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed() method to no longer include a stage that caused the failure. I think this change should definitely stay, because with cancellation (as described above), a failure isn’t necessarily caused by a single stage. Author: Kay Ousterhout <[email protected]> Closes apache#309 from kayousterhout/stage_cancellation and squashes the following commits: 5533ecd [Kay Ousterhout] Fixes in response to Mark's review 320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.

Change the service name field of osb config according to osb-checker schema

apache#309)

kayousterhout mentioned this pull request Apr 3, 2014

SPARK-1202 - Add a "cancel" button in the UI for stages #246

Closed

markhamstra reviewed Apr 4, 2014
View reviewed changes

Fixes in response to Mark's review

5533ecd

kayousterhout changed the title ~~[SPARK-1397] [WIP] Notify SparkListeners when stages fail or are cancelled.~~ [SPARK-1397] Notify SparkListeners when stages fail or are cancelled. Apr 8, 2014

asfgit closed this in fac6085 Apr 8, 2014

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#309 from liu-sheng/osb-name-schema

aa02d1d

Change the service name field of osb config according to osb-checker schema

wangyum mentioned this pull request May 31, 2020

[SPARK-31705][SQL] Push predicate through join by rewriting join condition to conjunctive normal form #28575

Closed

wangyum mentioned this pull request Aug 19, 2020

[SPARK-32444][SQL] Infer filters from DPP #29243

Closed

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package (

3ff579a

apache#309)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

kayousterhout commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

pwendell commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

markhamstra commented Apr 3, 2014

markhamstra Apr 4, 2014

kayousterhout Apr 8, 2014

markhamstra Apr 8, 2014

kayousterhout Apr 8, 2014

markhamstra commented Apr 4, 2014

kayousterhout commented Apr 4, 2014

pwendell commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

kayousterhout commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

kayousterhout commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

pwendell commented Apr 8, 2014

[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

Conversation

kayousterhout commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

pwendell commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

AmplabJenkins commented Apr 3, 2014

markhamstra commented Apr 3, 2014

markhamstra Apr 4, 2014

Choose a reason for hiding this comment

kayousterhout Apr 8, 2014

Choose a reason for hiding this comment

markhamstra Apr 8, 2014

Choose a reason for hiding this comment

kayousterhout Apr 8, 2014

Choose a reason for hiding this comment

markhamstra commented Apr 4, 2014

kayousterhout commented Apr 4, 2014

pwendell commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

kayousterhout commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

kayousterhout commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

AmplabJenkins commented Apr 8, 2014

pwendell commented Apr 8, 2014