Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL][WIP] Refined Thrift server test suite #2214

Closed
wants to merge 8 commits into from

Conversation

liancheng
Copy link
Contributor

NOTE

This PR is not ready to be merged yet due to some issue only reproducible with Jenkins build triggered by GitHub.

This PR fixes two issues of HiveThriftServer2Suite and brings 1 enhancement:

  1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
  2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
  3. When the forked server process ends prematurely (e.g., fails to start), the serverRunning promise is completed with a failure, preventing the test code to keep waiting until timeout.

So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.

@liancheng
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have started for PR 2214 at commit 983d030.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have finished for PR 2214 at commit 983d030.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2014

ok to test

@liancheng
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2214 at commit 94a83ba.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

The test failures only occur when running on Jenkins, I couldn't reproduce it either locally or on Jenkins server node. May fire some debugging commits later to investigate this issue.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2214 at commit 94a83ba.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

retest this please

@liancheng
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2214 at commit 1931e96.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2214 at commit 1931e96.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: BlockManagerId, maxMem: Long)
    • case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)
    • case class SparkListenerApplicationStart(appName: String, appId: Option[String], time: Long,

@liancheng
Copy link
Contributor Author

Hmm, a summary about the failure pattern of HiveThriftServer2Suite:

  • Running locally with
    • hive-thriftserver/test-only *.*Suite

      PASS

    • test

      PASS

  • SSH to Jenkins node, and then
    • run with sudo and hive-thriftserver/test-only *.*Suite

      PASS

    • run with sudo and test

      PASS

  • Jenkins build triggered by GitHub with
    • modified dev/run-tests script that only hive-thriftserver/test-only *.*Suite

      PASS

    • normal dev/run-tests script

      FAIL

      ... even if HiveThriftServer2 is the first test suite that got executed. sbin/start-thriftserver.sh script is called, since we can see output of compute-classpath.sh from the log, but it seems that the JVM process fails to start.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2214 at commit 1185d79.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 2214 at commit 1185d79.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng changed the title [SQL] Refined Thrift server test suite [SQL][WIP] Refined Thrift server test suite Sep 4, 2014
@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 2214 at commit c537e37.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

add to whitelist

@liancheng
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

Tests timed out after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

QA tests have started for PR 2214 at commit a1ad308.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

Tests timed out after a configured wait of 120m.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 9, 2014

test this please

@marmbrus
Copy link
Contributor

marmbrus commented Sep 9, 2014

@liancheng are we still debugging issues here? or just waiting for it to pass?

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2214 at commit a1ad308.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

@marmbrus not ready yet, was not able to debug it since Jenkins was quite crazy these days. I'll remove the WIP tag once it's ready.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have finished for PR 2214 at commit a1ad308.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2214 at commit a1ad308.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2214 at commit 23d96f1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2214 at commit a1ad308.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2214 at commit 23d96f1.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2214 at commit 23d96f1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2214 at commit 23d96f1.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

#2675 supersedes this one, closing.

@liancheng liancheng closed this Oct 8, 2014
@liancheng liancheng deleted the refine-thrift-test branch October 8, 2014 01:35
@liancheng liancheng deleted the refine-thrift-test branch October 8, 2014 01:35
asfgit pushed a commit that referenced this pull request Oct 13, 2014
As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.

In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.

As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](#2675 (comment)) below for details.

----

(Copied from PR description of #2214)

This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:

1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.

So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.

Author: Cheng Lian <[email protected]>
Author: wangfei <[email protected]>

Closes #2675 from liancheng/fix-thriftserver-tests and squashes the following commits:

1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
7805c33 [wangfei]  reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
d116405 [wangfei] make sure that log4j level is INFO
ee92a82 [Cheng Lian] Relaxes timeout
7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver
liancheng added a commit to liancheng/spark that referenced this pull request Nov 9, 2014
As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR apache#2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.

In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.

As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](apache#2675 (comment)) below for details.

----

(Copied from PR description of apache#2214)

This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:

1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.

So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.

Author: Cheng Lian <[email protected]>
Author: wangfei <[email protected]>

Closes apache#2675 from liancheng/fix-thriftserver-tests and squashes the following commits:

1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
7805c33 [wangfei]  reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
d116405 [wangfei] make sure that log4j level is INFO
ee92a82 [Cheng Lian] Relaxes timeout
7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver

Conflicts:
	sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants