[SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage #6291

markhamstra · 2015-05-20T20:01:34Z

This issue was addressed in #5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug. The intent of submitMissingTasks should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId". Because of a long-standing bug, the jobId parameter was never being used. Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880.

The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks.

This fix should be applied to all maintenance branches, since it has existed since 1.0.

@kayousterhout @pankajarora12

SparkQA · 2015-05-20T21:56:49Z

Test build #33165 has finished for PR 6291 at commit 71ea2a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class MultilabelMetrics(JavaModelWrapper):
- class SimpleFunctionRegistry(val conf: CatalystConf) extends FunctionRegistry
- class GroupedData protected[sql](

Author: Mark Hamstra Apache Spark master PR: apache#6291 This issue was addressed in apache#5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug. The intent of submitMissingTasks should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId". Because of a long-standing bug, the jobId parameter was never being used. Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880. The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks. This fix should be applied to all maintenance branches, since it has existed since 1.0.

Applying fix from apache#6291 by Mark Hamstra

squito · 2015-05-22T22:02:39Z

can you add a test case? Seems especially important given that a previous fix didn't actually catch the bug. I don't understand what's going on well enough to know what that test is [actually that is part of the reason I'd like to see a reproduction :) ] but maybe this is a case where it doesn't really make sense to make a very narrow unit case; but instead we need something which just stresses the DAGScheduler with a bunch of jobs that have a high likelihood of triggering this, so we just run it a lot. (Which might mean it needs to be run outside of the PR builder ...)

markhamstra · 2015-05-26T16:07:52Z

@squito I'm not sure what you want to test. The change is actually very straightforward. In essence, we started with what should have been obviously broken code:

getMissingTasks(aStage, theActiveJobIdForTheStage)
...
def getMissingTasks(stage, jobId) = {
  // ignore jobId
  ...
  val properties = jobIdToActiveJob(wrongAndSometimesNotThereJobId).properties
  ...
}

...to code that covers up the problem:

  val properties = jobIdToActiveJob.get(wrongAndSometimesNotThereJobId).map(_.properties).orNull

...to code that just uses theActiveJobIdForTheStage as it should have all along.

squito · 2015-05-26T18:14:56Z

@markhamstra oh I'm not saying that your change is bad or questionable at all. But I am wondering, what actually went wrong before this change? Are we sure this change fixes it? Can we protect against future regression? My point is that given that the previous attempted fix #5494 didn't solve it, and nobody has noticed the problem in all this time, it seems worth putting in a test which reproduces the exception without your change, and passes after your change.

The jira doesn't have enough info for me to suggest what that reproduction would be, but it seems like you understand it better than me.

kayousterhout · 2015-05-26T18:48:55Z

The change looks good but +1 to @squito's request for a test. I'm still a little confused about how we could get into a situation where the Stage's jobId doesn't exist anymore, and a test would help clarify that and make sure the bug doesn't resurface in the future.

Is there a bigger issue lurking here, that the jobId associated with a Stage object is not generally safe for use?

markhamstra · 2015-05-26T19:06:32Z

Ok, so what was happening is that a stage would get created as part of a particular job. When we'd getMissingTasks for that stage, we were always using the jobId under which the stage was created. That's the right thing to do as long as that job is an ActiveJob, but if that job completes and a subsequent job needs to recalculate the stage's results, trying to get the ActiveJob for the already completed job is going to fail.

We're trying to get an ActiveJob for the needed stage because we want to use that job's properties. In the sense that no exception will be thrown and the TaskSet will be submitted with default properties, it will work to use null properties when the lookup of the no-longer-ActiveJob fails, but that's not really what we want.

Before calling submitMissingTasks, we always get the ActiveJob under which we want those tasks to run, so we should just use that ActiveJob instead of ignoring it. In fact, we may also want to change this part of submitMissingTasks:

taskScheduler.submitTasks(
        new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))

What that is saying now is to use the jobId under which the stage was created as the priority, which will be a more urgent FIFO priority than the ActiveJob under which the stage is actually being run if the ActiveJob for the stage is no longer that at-stage-creation job. To be consistent with using the properties from the ActiveJob, we should also be using the jobId here instead of stage.jobId.

kayousterhout · 2015-05-26T19:12:05Z

Ok thanks for the explanation, and it sounds like writing a unit test for that case should be straightforward?

I was going to suggest that we just pass the properties into this method to sidestep the problem altogether, but I agree with your assessment that we should also change the submitTasks() call to use the newer jobId!

squito · 2015-05-26T19:39:22Z

thanks mark. If I understand correctly, the earlier PR did "fix" the NoSuchElementException as reported in the jira. So its not like you can write a test case which creates an exception before your fix here. But, the earlier PR didn't fix the real mistake in the code, so the remaining issue is things like wrong priority, jobGroup, etc.

markhamstra · 2015-05-26T19:40:32Z

@kayousterhout In looking through our other uses of stage.jobId in the DAGScheduler, I didn't see anything that jumped out as an obvious problem except for the already mentioned submitTasks call. It's going too far to say that the jobId associated with a Stage object is not generally safe for use, but we do need to keep in mind that that jobId doesn't necessarily map to a still ActiveJob.

kayousterhout · 2015-05-26T20:06:03Z

@markhamstra I just submitted #6418 to try to improve the naming around this a bit to avoid these issues in the future.

markhamstra · 2015-05-26T20:48:00Z

Added the proposed change in submitTasks.

SparkQA · 2015-05-26T22:39:51Z

Test build #33538 has finished for PR 6291 at commit 5570e8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-27T17:24:01Z

Test build #33595 has finished for PR 6291 at commit 26abe08.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-27T18:43:21Z

Test build #33598 has finished for PR 6291 at commit c41f894.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-01T18:02:47Z

Should we aim to get this in for Spark 1.5? Looks like it's conflicted. @kayousterhout @squito, will you be able to sign off after this is brought up-to-date?

markhamstra · 2015-08-01T18:09:51Z

I just brought it up-to-date. Putting together a test or two is something I'm going to try to get to today or tomorrow.

SparkQA · 2015-08-01T20:31:17Z

Test build #39369 has finished for PR 6291 at commit af6b628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

markhamstra · 2015-08-04T22:12:40Z

ping("test added")

squito · 2015-08-04T22:26:14Z

thanks mark, I should be able to look at this later tonight

markhamstra · 2015-08-04T22:55:05Z

Yeah, don't bother quite yet -- the test isn't actually exercising the code path that it needs to. :(

markhamstra · 2015-11-06T19:01:45Z

Nothing actually new -- I still think this one is ready to go after a little review. I am noting, however, that the prior discussion misses one element that raises the significance of retaining the correct job properties a little more: the executionId in SQLExecution is also a local job property, so we really don't want to be losing that unnecessarily. @marmbrus

squito · 2015-11-06T19:28:56Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+
+    // remove job1 as an ActiveJob
+    cancel(jobId1)
+    sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)


mostly for my own understanding -- I don't actually think this is necessary, right? At first I was thinking we need to put this in a bunch more places, as I know some other tests recently had some flakiness fixed by adding this in. But looking closer, this is only needed when are checking anything stored in our spark listener, not in the scheduler itself, right?

That said, if there is any doubt, better to leave it in place.

squito · 2015-11-06T19:49:33Z

sorry for the delayed review @markhamstra. just some minor comments, otherwise lgtm.

As far as the sql executionId, I assume this change is also the right fix there as well -- it would be pretty weird if the executionId was supposed to retain the old value, even after the original job had been cancelled? Or are you just saying this fix is more important than we thought? Of course, the executionId suffers from the same problem we've already noted with properties & priority, that a taskSet which was already started is still "stuck" with the executionId it started with.

marmbrus · 2015-11-06T19:56:17Z

@zsxwing can you look at the execution id stuff here?

markhamstra · 2015-11-06T20:19:47Z

@squito Yes, I'm pretty sure that this is the right fix for executionId as well. I'm just saying that previously we were talking about losing the job's properties only really affecting scheduling priority and job description, so there wasn't much impact to using the prior safe-but-not-quite-correct fix. With Spark SQL also relying upon the job properties, it's a little more important to make sure they are correct.

zsxwing · 2015-11-06T21:41:31Z

This fix looks good to me.

However, both SQL and Streaming use only SparkListenerJobStart.properties. So actually, they are not affected by this bug.

markhamstra · 2015-11-06T23:18:35Z

Ok, how about the setting of JobGroup and the "spark.scheduler.pool" property in thriftserver/SparkExecuteStatementOperation.scala? Again, I don't see any reason why the fix would be any different; I'm just noting places where the bug can potentially have an effect.

markhamstra · 2015-11-13T21:06:45Z

Once more, Imran.

squito · 2015-11-13T21:59:38Z

lgtm pending tests!

SparkQA · 2015-11-14T01:18:44Z

Test build #45892 has finished for PR 6291 at commit 2ff80ca.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

markhamstra · 2015-11-14T19:28:40Z

retest this please

SparkQA · 2015-11-14T21:17:06Z

Test build #45934 has finished for PR 6291 at commit 2ff80ca.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-16T10:02:43Z

Test build #45987 has finished for PR 6291 at commit 31ba5de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ted with a Stage This issue was addressed in #5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug. The intent of `submitMissingTasks` should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId". Because of a long-standing bug, the `jobId` parameter was never being used. Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880. The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks. This fix should be applied to all maintenance branches, since it has existed since 1.0. kayousterhout pankajarora12 Author: Mark Hamstra <[email protected]> Author: Imran Rashid <[email protected]> Closes #6291 from markhamstra/SPARK-6880. (cherry picked from commit 0a5aef7) Signed-off-by: Imran Rashid <[email protected]>

squito · 2015-11-25T15:43:59Z

thanks, merged to master / 1.6 / 1.5

markhamstra mentioned this pull request May 21, 2015

[CORE] SPARK-6880: Fixed null check when all the dependent stages are… alteryx/spark#56

Closed

mbautin added a commit to mbautin/spark that referenced this pull request May 21, 2015

[SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage

e1553a4

Applying fix from apache#6291 by Mark Hamstra

mbautin mentioned this pull request May 21, 2015

[SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage alteryx/spark#57

Merged

markhamstra force-pushed the SPARK-6880 branch 4 times, most recently from 26abe08 to c41f894 Compare May 27, 2015 16:42

markhamstra force-pushed the SPARK-6880 branch from c41f894 to af6b628 Compare August 1, 2015 18:08

markhamstra force-pushed the SPARK-6880 branch from af6b628 to aaa3a7d Compare August 4, 2015 22:11

squito reviewed Nov 6, 2015
View reviewed changes

markhamstra force-pushed the SPARK-6880 branch from fe5eab3 to 2ff80ca Compare November 13, 2015 21:06

markhamstra and others added 10 commits November 15, 2015 23:10

Use properties from ActiveJob associated with a Stage

e8d9616

stage.jobId -> jobId in taskScheduler.submitTasks

053471c

Added test "stage used by two jobs, the first no longer active"

d2c9465

add test of correct behavior

80b47a1

expand comment

6f00021

add a test which includes a fetch failure

5b77f89

cleanup comments

4454a4e

A tiny amount of refactoring and some comment editing

091e19a

DRY code review

0a4ef24

code review

31ba5de

markhamstra force-pushed the SPARK-6880 branch from 2ff80ca to 31ba5de Compare November 16, 2015 07:39

asfgit closed this in 0a5aef7 Nov 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage #6291

[SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage #6291

markhamstra commented May 20, 2015

SparkQA commented May 20, 2015

squito commented May 22, 2015

markhamstra commented May 26, 2015

squito commented May 26, 2015

kayousterhout commented May 26, 2015

markhamstra commented May 26, 2015

kayousterhout commented May 26, 2015

squito commented May 26, 2015

markhamstra commented May 26, 2015

kayousterhout commented May 26, 2015

markhamstra commented May 26, 2015

SparkQA commented May 26, 2015

SparkQA commented May 27, 2015

SparkQA commented May 27, 2015

JoshRosen commented Aug 1, 2015

markhamstra commented Aug 1, 2015

SparkQA commented Aug 1, 2015

markhamstra commented Aug 4, 2015

squito commented Aug 4, 2015

markhamstra commented Aug 4, 2015

markhamstra commented Nov 6, 2015

squito Nov 6, 2015

squito commented Nov 6, 2015

marmbrus commented Nov 6, 2015

markhamstra commented Nov 6, 2015

zsxwing commented Nov 6, 2015

markhamstra commented Nov 6, 2015

markhamstra commented Nov 13, 2015

squito commented Nov 13, 2015

SparkQA commented Nov 14, 2015

markhamstra commented Nov 14, 2015

SparkQA commented Nov 14, 2015

SparkQA commented Nov 16, 2015

squito commented Nov 25, 2015

[SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage #6291

[SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage #6291

Conversation

markhamstra commented May 20, 2015

SparkQA commented May 20, 2015

squito commented May 22, 2015

markhamstra commented May 26, 2015

squito commented May 26, 2015

kayousterhout commented May 26, 2015

markhamstra commented May 26, 2015

kayousterhout commented May 26, 2015

squito commented May 26, 2015

markhamstra commented May 26, 2015

kayousterhout commented May 26, 2015

markhamstra commented May 26, 2015

SparkQA commented May 26, 2015

SparkQA commented May 27, 2015

SparkQA commented May 27, 2015

JoshRosen commented Aug 1, 2015

markhamstra commented Aug 1, 2015

SparkQA commented Aug 1, 2015

markhamstra commented Aug 4, 2015

squito commented Aug 4, 2015

markhamstra commented Aug 4, 2015

markhamstra commented Nov 6, 2015

squito Nov 6, 2015

Choose a reason for hiding this comment

squito commented Nov 6, 2015

marmbrus commented Nov 6, 2015

markhamstra commented Nov 6, 2015

zsxwing commented Nov 6, 2015

markhamstra commented Nov 6, 2015

markhamstra commented Nov 13, 2015

squito commented Nov 13, 2015

SparkQA commented Nov 14, 2015

markhamstra commented Nov 14, 2015

SparkQA commented Nov 14, 2015

SparkQA commented Nov 16, 2015

squito commented Nov 25, 2015