[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator #8544

JoshRosen · 2015-09-01T01:44:56Z

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

JoshRosen · 2015-09-01T01:45:38Z

core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala

@@ -122,7 +121,8 @@ object SparkHadoopMapRedUtil extends Logging {

      if (shouldCoordinateWithDriver) {
        val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator
-        val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, attemptId)
+        val taskAttemptNumber = TaskContext.get().attemptNumber()


This is the key change in this patch; these two lines are technically the minimum diff required to fix this bug. The rest of the changes are renaming / cleanup to make the units a bit clearer.

JoshRosen · 2015-09-01T01:50:14Z

/cc @marmbrus, @yhuai, @pwendell

SparkQA · 2015-09-01T02:07:47Z

Test build #41850 has finished for PR 8544 at commit a09814b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

JoshRosen · 2015-09-01T17:10:01Z

I ended up having to add some MiMa excludes. These only impact internal classes but they are RPCs that are sent over the wire. Would be good to get confirmation that this is okay from a compatibility POV.

SparkQA · 2015-09-01T17:34:49Z

Test build #41873 has finished for PR 8544 at commit 0059c95.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

andrewor14 · 2015-09-01T18:13:30Z

core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala

@@ -29,7 +29,7 @@ import org.apache.spark.annotation.DeveloperApi
 class TaskInfo(
    val taskId: Long,
    val index: Int,
-    val attempt: Int,
+    val attempt: Int, // this is a task attempt number, not a globally-unique task attempt id


should we just rename this and add a deprecated val for backward compatibility?

Yeah, that looks safe to do. Will do this when I update.

squito · 2015-09-01T20:07:17Z

+1 on the renaming.

While you are touching this, should the jobId in the OutputCommitCoordinator & related code be renamed to stageId, since that is what it really is?

spark/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Line 1103 in 3f63bd6

writer.setup(context.stageId, context.partitionId, taskAttemptId)

JoshRosen · 2015-09-01T23:25:29Z

Jenkins, retest this please.

SparkQA · 2015-09-02T01:44:12Z

Test build #41900 has finished for PR 8544 at commit 0059c95.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(
- class DCT(JavaTransformer, HasInputCol, HasOutputCol):
- class SQLTransformer(JavaTransformer):

SparkQA · 2015-09-14T23:02:20Z

Test build #42438 has finished for PR 8544 at commit 81b86a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

JoshRosen · 2015-09-14T23:10:44Z

Still have to address one comment here; was just letting tests run to rule out merge conflicts.

JoshRosen · 2015-09-15T02:19:37Z

@squito, I took a look at the OutputCommitCoordinator itself and, as far as I can conclude, it seems to be using stageIds properly. The one area where there might be some confusion is in TaskCommitDeinedException, but in that case I think our use of job ids in log messages might actually be accurate.

In either case, I think the basics fix in this patch is correct and would like to merge this version for inclusion in 1.5.1.

SparkQA · 2015-09-15T02:34:36Z

Test build #42463 has finished for PR 8544 at commit 035d660.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

…port)" This reverts commit 7af9c6b.

SparkQA · 2015-09-15T05:20:48Z

Test build #42464 has finished for PR 8544 at commit 0d40f83.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

JoshRosen · 2015-09-15T05:44:35Z

Jenkins, retest this please.

SparkQA · 2015-09-15T08:37:52Z

Test build #42470 has finished for PR 8544 at commit 0d40f83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

SparkQA · 2015-09-15T20:47:41Z

Test build #42496 has finished for PR 8544 at commit 4dcb78a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(

vanzin · 2015-09-15T21:02:16Z

retest this please

SparkQA · 2015-09-15T23:38:02Z

Test build #42505 has finished for PR 8544 at commit 4dcb78a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-09-15T23:52:31Z

I've fixed the MiMa merge conflicts, so as soon as this latest test run passes MiMa I'm going to merge this.

…mitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <[email protected]> Closes #8544 from JoshRosen/SPARK-10381. (cherry picked from commit 38700ea) Signed-off-by: Josh Rosen <[email protected]>

SparkQA · 2015-09-16T02:24:13Z

Test build #42515 has finished for PR 8544 at commit edbbf6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

…mitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <[email protected]> Closes apache#8544 from JoshRosen/SPARK-10381. (cherry picked from commit 38700ea) Signed-off-by: Josh Rosen <[email protected]>

JoshRosen · 2015-09-17T01:18:18Z

I've opened #8789 and #8790 to backport this fix to 1.4.x and 1.3.x.

…mitCoordinator (branch-1.4 backport) This is a backport of #8544 to `branch-1.4` for inclusion in 1.4.2. Author: Josh Rosen <[email protected]> Closes #8789 from JoshRosen/SPARK-10381-1.4.

…mitCoordinator (branch-1.3 backport) This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2. Author: Josh Rosen <[email protected]> Closes #8790 from JoshRosen/SPARK-10381-1.3.

…mitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <[email protected]> Closes apache#8544 from JoshRosen/SPARK-10381. (cherry picked from commit 38700ea) Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit 2bbcbc6)

JoshRosen added 2 commits August 31, 2015 17:31

Add failing integration test.

c6220fd

Fix bug by always using task attempt number.

a09814b

JoshRosen reviewed Sep 1, 2015
View reviewed changes

Add MiMa excludes.

0059c95

andrewor14 reviewed Sep 1, 2015
View reviewed changes

Merge remote-tracking branch 'origin/master' into SPARK-10381

81b86a1

JoshRosen added 2 commits September 14, 2015 19:00

Only add MiMa changes for master (will add others during backport)

7af9c6b

More attempt -> attemptNumber renaming, per Andrew's suggestion.

035d660

Revert "Only add MiMa changes for master (will add others during back…

0d40f83

…port)" This reverts commit 7af9c6b.

Merge remote-tracking branch 'origin/master' into SPARK-10381

4dcb78a

Merge remote-tracking branch 'origin/master' into SPARK-10381

edbbf6f

asfgit closed this in 38700ea Sep 16, 2015

JoshRosen deleted the SPARK-10381 branch September 16, 2015 23:29

JoshRosen mentioned this pull request Sep 17, 2015

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator (branch-1.4 backport) #8789

Closed

JoshRosen mentioned this pull request Sep 17, 2015

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator (branch-1.3 backport) #8790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator #8544

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator #8544

JoshRosen commented Sep 1, 2015

JoshRosen Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

andrewor14 Sep 1, 2015

JoshRosen Sep 2, 2015

squito commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 2, 2015

SparkQA commented Sep 14, 2015

JoshRosen commented Sep 14, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 15, 2015

SparkQA commented Sep 15, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 15, 2015

SparkQA commented Sep 15, 2015

vanzin commented Sep 15, 2015

SparkQA commented Sep 15, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 16, 2015

JoshRosen commented Sep 17, 2015

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator #8544

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator #8544

Conversation

JoshRosen commented Sep 1, 2015

JoshRosen Sep 1, 2015

Choose a reason for hiding this comment

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

andrewor14 Sep 1, 2015

Choose a reason for hiding this comment

JoshRosen Sep 2, 2015

Choose a reason for hiding this comment

squito commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 2, 2015

SparkQA commented Sep 14, 2015

JoshRosen commented Sep 14, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 15, 2015

SparkQA commented Sep 15, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 15, 2015

SparkQA commented Sep 15, 2015

vanzin commented Sep 15, 2015

SparkQA commented Sep 15, 2015

JoshRosen commented Sep 15, 2015

SparkQA commented Sep 16, 2015

JoshRosen commented Sep 17, 2015