[SPARK-23033][SS] Don't use task level retry for continuous processing #20225

jose-torres · 2018-01-11T00:03:51Z

What changes were proposed in this pull request?

Continuous processing tasks will fail on any attempt number greater than 0. ContinuousExecution will catch these failures and restart globally from the last recorded checkpoints.

How was this patch tested?

unit test

SparkQA · 2018-01-11T04:17:49Z

Test build #85939 has finished for PR 20225 at commit 1bf613f.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-11T18:58:31Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

+        eventually(timeout(streamingTimeout)) { assert(taskId != -1) }
+        spark.sparkContext.killTaskAttempt(taskId)
+      },
+      Execute(waitForRateSourceTriggers(_, 4)),


can you explain the logic behind this test? what does this test do?

It kills an arbitrary task, and checks that query execution continues onward unaffected.

I've added a check that the run ID has changed, confirming that the retry was indeed made at the ContinuousExecution level.

jose-torres · 2018-01-11T19:48:05Z

The hang in test build #85939 was a test issue in ContinuousStressSuite, which I could reproduce locally by bumping up the rows per second.

When the rate of incoming data is too high, the query execution does make progress - but the answer checking operation in the test is expensive. And the rate source was continuing to run as it happened. Since the executors are local, an overload means the answer check will take unreasonably long to finish.

I've fixed this by stopping the rate source before checking the answer in the stress tests.

tdas · 2018-01-11T21:14:56Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

@@ -219,6 +201,44 @@ class ContinuousSuite extends ContinuousSuiteBase {
      StopStream)
  }

+  test("kill task") {


this test does not verify killing tasks :) it verifies "task failure stops the query" or "task failure restarts the query"

tdas · 2018-01-11T22:47:46Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

+          query.exception.get.getCause.getCause.getCause.isInstanceOf[ContinuousTaskRetryException])
+      })
+
+    spark.sparkContext.removeSparkListener(listener)


put this in a finally clause.

tdas · 2018-01-11T22:48:10Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

-          vals.contains(i)
-        })
-      })
+      StopStream,


What is this for?

This is the earlier comment about the overloaded failure mode this PR exposed.

tdas · 2018-01-11T22:48:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

@@ -280,6 +294,7 @@ class ContinuousStressSuite extends ContinuousSuiteBase {
      AwaitEpoch(0),
      Execute(waitForRateSourceTriggers(_, 201)),
      IncrementEpoch(),
+      StopStream,


Why are these needed?

tdas · 2018-01-11T22:50:26Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

+        // Wait until a task is started, then kill its first attempt.
+        eventually(timeout(streamingTimeout)) { assert(taskId != -1) }
+        spark.sparkContext.killTaskAttempt(taskId)
+        eventually(timeout(streamingTimeout)) {


Can this be checked with a "ExpectFailure" test? Better to test using the same harness that is used for microbatch so that we are sure they failure behavior is the same.

tdas · 2018-01-11T22:50:41Z

...scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTaskRetryException.scala

+
+import org.apache.spark.SparkException
+
+class ContinuousTaskRetryException


tdas · 2018-01-11T22:50:56Z

.../scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDDIter.scala

@@ -52,6 +52,10 @@ class ContinuousDataSourceRDD(
  }

  override def compute(split: Partition, context: TaskContext): Iterator[UnsafeRow] = {
+    if (context.attemptNumber() != 0) {


Add comments on what this is.

SparkQA · 2018-01-11T23:08:20Z

Test build #85987 has finished for PR 20225 at commit 3b19fcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-12T00:12:40Z

Test build #85990 has finished for PR 20225 at commit ad2f206.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-12T00:59:58Z

Test build #85993 has finished for PR 20225 at commit 54d3a2c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-12T01:32:18Z

The problem in the test runs above was also a test issue. The new code changed the synchronization such that the row wasn't written when the test expected; I verified manually that the failing test doesn't actually enter the attemptNumber != 0 branch.

SparkQA · 2018-01-12T01:32:47Z

Test build #86007 has started for PR 20225 at commit 49f1eb6.

SparkQA · 2018-01-12T03:57:46Z

Test build #85997 has finished for PR 20225 at commit cea2ddc.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-12T05:32:12Z

The most recent test build failure is from an earlier commit which I think is obsoleted. I think #86007 is correct but we should retest this please to confirm.

jose-torres · 2018-01-12T05:33:50Z

retest this please

SparkQA · 2018-01-12T08:05:02Z

Test build #86020 has finished for PR 20225 at commit 49f1eb6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-12T17:01:48Z

retest this please

SparkQA · 2018-01-12T22:02:51Z

Test build #86046 has finished for PR 20225 at commit 49f1eb6.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-17T07:17:53Z

Fix merge conflicts. And add [SS} to the title of this PR.

SparkQA · 2018-01-17T21:35:08Z

Test build #86283 has finished for PR 20225 at commit f97bc9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-01-17T21:49:19Z

LGTM. Merging this to master and 2.3

## What changes were proposed in this pull request? Continuous processing tasks will fail on any attempt number greater than 0. ContinuousExecution will catch these failures and restart globally from the last recorded checkpoints. ## How was this patch tested? unit test Author: Jose Torres <[email protected]> Closes #20225 from jose-torres/no-retry. (cherry picked from commit 86a8450) Signed-off-by: Tathagata Das <[email protected]>

jose-torres added 4 commits January 10, 2018 15:48

fail if attempt number isn't 0

9f7066e

capture task retry and convert it to global retry

f641be0

bring in state update

761fd26

bring in stream test sync

1bf613f

tdas reviewed Jan 11, 2018

View reviewed changes

jose-torres added 3 commits January 11, 2018 11:18

stop before check in rate stream stress

f175094

Merge remote-tracking branch 'apache/master' into no-retry

5dbc038

fix merge

3b19fcb

tdas reviewed Jan 11, 2018

View reviewed changes

jose-torres added 2 commits January 11, 2018 13:19

rename test

ad2f206

stop the query rather than retry looping when task tries to retry

54d3a2c

tdas reviewed Jan 11, 2018

View reviewed changes

jose-torres added 3 commits January 11, 2018 15:10

use finally

b40c8f0

use ExpectFailure

b5d621b

add docs

cea2ddc

fix sync

49f1eb6

Merge remote-tracking branch 'apache/master' into no-retry

0122aeb

Merge remote-tracking branch 'apache/master' into no-retry

f97bc9d

jose-torres changed the title ~~[SPARK-23033] Don't use task level retry for continuous processing~~ [SPARK-23033][SS] Don't use task level retry for continuous processing Jan 17, 2018

asfgit closed this in 86a8450 Jan 17, 2018

xuanyuanking mentioned this pull request Feb 26, 2018

[SPARK-23033][SS][Follow Up] Task level retry for continuous processing #20675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23033][SS] Don't use task level retry for continuous processing #20225

[SPARK-23033][SS] Don't use task level retry for continuous processing #20225

jose-torres commented Jan 11, 2018

SparkQA commented Jan 11, 2018

tdas Jan 11, 2018

jose-torres Jan 11, 2018

jose-torres commented Jan 11, 2018

tdas Jan 11, 2018

tdas Jan 11, 2018

tdas Jan 11, 2018

jose-torres Jan 11, 2018

tdas Jan 11, 2018

tdas Jan 11, 2018

tdas Jan 11, 2018

tdas Jan 11, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

tdas commented Jan 17, 2018

SparkQA commented Jan 17, 2018

tdas commented Jan 17, 2018


		import org.apache.spark.SparkException

		class ContinuousTaskRetryException

[SPARK-23033][SS] Don't use task level retry for continuous processing #20225

[SPARK-23033][SS] Don't use task level retry for continuous processing #20225

Conversation

jose-torres commented Jan 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose-torres commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

jose-torres commented Jan 12, 2018

SparkQA commented Jan 12, 2018

tdas commented Jan 17, 2018

SparkQA commented Jan 17, 2018

tdas commented Jan 17, 2018