Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20301][FLAKY-TEST] Fix Hadoop Shell.runCommand flakiness in Structured Streaming tests #17613

Closed
wants to merge 3 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Apr 12, 2017

What changes were proposed in this pull request?

Some Structured Streaming tests show flakiness such as:

[info] - prune results by current_date, complete mode - 696 *** FAILED *** (10 seconds, 937 milliseconds)
[info]   Timed out while stopping and waiting for microbatchthread to terminate.: The code passed to failAfter did not complete within 10 seconds.

This happens when we wait for the stream to stop, but it doesn't. The reason it doesn't stop is that we interrupt the microBatchThread, but Hadoop's Shell.runCommand swallows the interrupt exception, and the exception is not propagated upstream to the microBatchThread. Then this thread continues to run, only to start blocking on the streamManualClock.

How was this patch tested?

Thousand retries locally and Jenkins of the flaky tests

true
}
// Report trigger as finished and construct progress object.
finishTrigger(dataAvailable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you move this out of the reportTimeTaken { ... }?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I moved it out. Is the diff and whitespace confusing?

@@ -277,6 +277,11 @@ trait StreamTest extends QueryTest with SharedSQLContext with Timeouts {

def threadState =
if (currentStream != null && currentStream.microBatchThread.isAlive) "alive" else "dead"
def threadStackTrace = if (currentStream != null && currentStream.microBatchThread.isAlive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on keeping this.

@SparkQA
Copy link

SparkQA commented Apr 12, 2017

Test build #75720 has finished for PR 17613 at commit c060e6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2017

Test build #75726 has started for PR 17613 at commit 4d6e3cb.

@brkyvz brkyvz changed the title [SPARK-20301][FLAKY-TEST][DO NOT MERGE] Fix Hadoop Shell.runCommand flakiness in Structured Streaming tests [SPARK-20301][FLAKY-TEST] Fix Hadoop Shell.runCommand flakiness in Structured Streaming tests Apr 12, 2017
@brkyvz
Copy link
Contributor Author

brkyvz commented Apr 12, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2017

Test build #75733 has finished for PR 17613 at commit 4d6e3cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Apr 12, 2017

LGTM. Merging to master

@asfgit asfgit closed this in 924c424 Apr 12, 2017
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…ructured Streaming tests

## What changes were proposed in this pull request?

Some Structured Streaming tests show flakiness such as:
```
[info] - prune results by current_date, complete mode - 696 *** FAILED *** (10 seconds, 937 milliseconds)
[info]   Timed out while stopping and waiting for microbatchthread to terminate.: The code passed to failAfter did not complete within 10 seconds.
```

This happens when we wait for the stream to stop, but it doesn't. The reason it doesn't stop is that we interrupt the microBatchThread, but Hadoop's `Shell.runCommand` swallows the interrupt exception, and the exception is not propagated upstream to the microBatchThread. Then this thread continues to run, only to start blocking on the `streamManualClock`.

## How was this patch tested?

Thousand retries locally and [Jenkins](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75720/testReport) of the flaky tests

Author: Burak Yavuz <[email protected]>

Closes apache#17613 from brkyvz/flaky-stream-agg.
@brkyvz brkyvz deleted the flaky-stream-agg branch February 3, 2019 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants