Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradle check retry #2638

Merged
merged 5 commits into from
Mar 30, 2022

Conversation

kotwanikunal
Copy link
Member

Description

  • Adds a retry mechanism for all test classes under the root project
  • This should help with gradle check failures

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kotwanikunal kotwanikunal requested a review from a team as a code owner March 29, 2022 06:43
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 811ee66
Log 3841

Reports 3841

build.gradle Outdated
apply plugin: "org.gradle.test-retry"
tasks.withType(Test).configureEach {
retry {
if (isCiServer) {
Copy link
Collaborator

@reta reta Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just not configure the task if it is not CI:

if (isCiServer) {
  tasks.withType(Test).configureEach {
    retry {
     ....
    }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would highly recommend to set maxFailures to some reasonable value, 5 for example:

     * The maximum number of test failures that are allowed before retrying is disabled.

Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start by retrying once (maxRetries = 1), which should weed out some of the worst offenders and point us to the failures we should focus on fixing. Otherwise we'll start relying on retries too much. WDYT?

I would make maxFailures a bit larger to really capture broken code vs. multiple flukes, like maybe 10, but don't feel strongly about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would really benefit from something like gradle/test-retry-gradle-plugin#29 for our use case to catch the worst offenders.
I agree with giving it a kick start with 1-2 retries and stopping on multiple test failures. From what I have noticed, it usually takes 2 retries recently to get it going. But let's start small enough to give a boost and not over rely.

I'll update the PR with { maxFailures = 10, maxRetries = 1 }.

@dblock
Copy link
Member

dblock commented Mar 29, 2022

The fact that gradle check succeeded here from the first try tells me that we're on the right track :)

Signed-off-by: Kunal Kotwani <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 3dcf356
Log 3865

Reports 3865

@kotwanikunal
Copy link
Member Author

kotwanikunal commented Mar 29, 2022

> Task :client:rest-high-level:asyncIntegTest

REPRODUCE WITH: ./gradlew ':client:rest-high-level:asyncIntegTest' --tests "org.opensearch.client.TasksIT.testGetValidTask" -Dtests.seed=91F1364AFC43A70 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-ME -Dtests.timezone=Pacific/Enderbury -Druntime.java=17

org.opensearch.client.TasksIT > testGetValidTask FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([91F1364AFC43A70:9573CA83C6999BB]:0)
        at org.junit.Assert.fail(Assert.java:87)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertTrue(Assert.java:53)
        at org.opensearch.client.OpenSearchRestHighLevelClientTestCase.lambda$checkTaskCompletionStatus$2(OpenSearchRestHighLevelClientTestCase.java:361)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1060)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1033)
        at org.opensearch.client.TasksIT.testGetValidTask(TasksIT.java:123)

Another random failure. Runs fine on my devbox -

> Task :client:rest-high-level:asyncIntegTest
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/kkotwani/workspace/OpenSearch/test/framework/build/distributions/framework-2.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/kkotwani/.gradle/wrapper/dists/gradle-7.3.3-all/4295vidhdd9hd3gbjyw1xqxpo/gradle-7.3.3/lib/plugins/gradle-testing-base-7.3.3.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 34s
181 actionable tasks: 11 executed, 170 up-to-date

@kotwanikunal
Copy link
Member Author

start gradle check

@dblock
Copy link
Member

dblock commented Mar 29, 2022

@kotwanikunal Did it retry org.opensearch.client.TasksIT.testGetValidTask or asyncIntegTest? Where do you see it in the log?

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

@kotwanikunal @dblock I believe the asyncIntegTest may not fall into Test task type:

tasks.withType(Test).configureEach {
    ...
}

We do some tricks here as well https://github.com/opensearch-project/OpenSearch/blob/main/gradle/code-coverage.gradle#L47

@kotwanikunal
Copy link
Member Author

@kotwanikunal Did it retry org.opensearch.client.TasksIT.testGetValidTask or asyncIntegTest? Where do you see it in the log?

It only retries the failing tests. I verified that on my local machine with different scenarios including integTest, internalClusterTest

In the individual project logs, you usually see x tests completed, n tests failed with the html report where n counts all the retries.

I will try force failing an asyncIntegTest on my devbox just to be sure.

@kotwanikunal
Copy link
Member Author

@kotwanikunal @dblock I believe the asyncIntegTest may not fall into Test task type:

tasks.withType(Test).configureEach {
    ...
}

We do some tricks here as well https://github.com/opensearch-project/OpenSearch/blob/main/gradle/code-coverage.gradle#L47

It does infact cover the asyncIntegTest type.

I ran ./gradlew ':client:rest-high-level:asyncIntegTest' --tests "org.opensearch.client.TasksIT.testGetValidTask" -Dtests.seed=91F1364AFC43A70 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-ME -Dtests.timezone=Pacific/Enderbury -Druntime.java=17 on my devbox by adding in a failing assert.

It looks like it retries the failing asyncIntegTest type -

image

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! If maxRetries = 1 is too noisy, I am ok with 2 or 3.

@tlfeng
Copy link
Collaborator

tlfeng commented Mar 29, 2022

@tlfeng
Copy link
Collaborator

tlfeng commented Mar 29, 2022

What wired here is I didn't see the retry occurs in the above failed gradle check, in the "Reports 3865"
image

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 3dcf356
Log 3868

Reports 3868

@dblock
Copy link
Member

dblock commented Mar 29, 2022

What wired here is I didn't see the retry occurs in the above failed gradle check, in the "report 3865" image

Is boolean isCiServer = System.getenv().containsKey("CI") failing here? Maybe just put retry all the time? It's useful for dev too.

@kotwanikunal
Copy link
Member Author

What wired here is I didn't see the retry occurs in the above failed gradle check, in the "report 3865" image

Is boolean isCiServer = System.getenv().containsKey("CI") failing here? Maybe just put retry all the time? It's useful for dev too.

Looks like it. Since it's a single retry, let me get rid of the CI condition. I can iteratively improve on it if we decide to have different configs for dev/CI.

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

@kotwanikunal use BuildParams instead?

if (BuildParams.isCi() == true) {
}

@kotwanikunal
Copy link
Member Author

@kotwanikunal use BuildParams instead?

if (BuildParams.isCi() == true) {
}

@reta How about adding in a single retry for dev and CI for now?

@dblock
Copy link
Member

dblock commented Mar 29, 2022

@kotwanikunal use BuildParams instead?

if (BuildParams.isCi() == true) {
}

@reta How about adding in a single retry for dev and CI for now?

Let's vote and if we have a deadlock, we'll just try again? :) I'm good with anything.

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

Let's vote and if we have a deadlock, we'll just try again? :) I'm good with anything.

I am fine with that ... for now

@mch2
Copy link
Member

mch2 commented Mar 29, 2022

How do we ensure we aren't swallowing failures with reproducible seeds? Does it retry with the same seed?

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

How do we ensure we aren't swallowing failures with reproducible seeds? Does it retry with the same seed?

I am not 100% sure but I would doubt the seed is preserved

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 37e398d
Log 3874

Reports 3874

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

Aha, this one seems to retried

image

And same seed was used

testHAProxyModeConnectionWorks
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([108B309DF15FC962:17FDE556CAD9BF3F]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([108B309DF15FC962:17FDE556CAD9BF3F]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.opensearch.cluster.remote.test.RemoteClustersIT.testHAProxyModeConnectionWorks(RemoteClustersIT.java:125)
	at java.base/jdk.internal.

@mch2 ^^

@dblock seems like 1 retry is not enough ... or we need to retry the while suite?

@kotwanikunal
Copy link
Member Author

Open issue for this test: #1703

@kotwanikunal
Copy link
Member Author

Aha, this one seems to retried

image

And same seed was used

testHAProxyModeConnectionWorks
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([108B309DF15FC962:17FDE556CAD9BF3F]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([108B309DF15FC962:17FDE556CAD9BF3F]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.opensearch.cluster.remote.test.RemoteClustersIT.testHAProxyModeConnectionWorks(RemoteClustersIT.java:125)
	at java.base/jdk.internal.

@mch2 ^^

@dblock seems like 1 retry is not enough ... or we need to retry the while suite?

Retrying the whole suite will make it difficult to pinpoint flakiness to particular tests. This helps us get to the most frequent offenders IMHO.
But I am all ears about what y'all have to say 🙂

@reta
Copy link
Collaborator

reta commented Mar 29, 2022

Retrying the whole suite will make it difficult to pinpoint flakiness to particular tests. This helps us get to the most frequent offenders IMHO.

I agree, I don't like that at all, thinking loudly - if cluster ended up in "bad" ( == not expected) state at startup, the test may never succeed ... but we should consider that to be separate problem I guess

@dblock
Copy link
Member

dblock commented Mar 29, 2022

Do you want to increase the number of retries @kotwanikunal?

Signed-off-by: Kunal Kotwani <[email protected]>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 7e141ba
Log 3880

Reports 3880

@kotwanikunal kotwanikunal merged commit 65cc56e into opensearch-project:main Mar 30, 2022
@kotwanikunal kotwanikunal deleted the gradle-check-retry branch March 30, 2022 04:03
@dblock dblock added backport 2.x Backport to 2.x branch backport 2.0 Backport to 2.0 branch backport 1.x labels Mar 30, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 30, 2022
* Add retry plugin support for Test implementations

Signed-off-by: Kunal Kotwani <[email protected]>

* Update test retry parameters

Signed-off-by: Kunal Kotwani <[email protected]>

* Remove CI environment check for test retries

Signed-off-by: Kunal Kotwani <[email protected]>

* Update retry count for tests

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 30, 2022
* Add retry plugin support for Test implementations

Signed-off-by: Kunal Kotwani <[email protected]>

* Update test retry parameters

Signed-off-by: Kunal Kotwani <[email protected]>

* Remove CI environment check for test retries

Signed-off-by: Kunal Kotwani <[email protected]>

* Update retry count for tests

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)
@opensearch-trigger-bot
Copy link
Contributor

The backport to 1.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.x 1.x
# Navigate to the new working tree
cd .worktrees/backport-1.x
# Create a new branch
git switch --create backport/backport-2638-to-1.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 65cc56e754e4a854963663ead5d06ea6e975d1eb
# Push it to GitHub
git push --set-upstream origin backport/backport-2638-to-1.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.x

Then, create a pull request where the base branch is 1.x and the compare/head branch is backport/backport-2638-to-1.x.

nknize pushed a commit that referenced this pull request Mar 30, 2022
Add retry plugin support for Test implementations

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)

Co-authored-by: Kunal Kotwani <[email protected]>
nknize pushed a commit that referenced this pull request Mar 30, 2022
Add retry plugin support for Test implementations

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)
@tlfeng
Copy link
Collaborator

tlfeng commented Mar 31, 2022

@kotwanikunal Would you backport the change into 1.x branch?

kotwanikunal added a commit to kotwanikunal/OpenSearch that referenced this pull request Apr 6, 2022
* Add retry plugin support for Test implementations

Signed-off-by: Kunal Kotwani <[email protected]>

* Update test retry parameters

Signed-off-by: Kunal Kotwani <[email protected]>

* Remove CI environment check for test retries

Signed-off-by: Kunal Kotwani <[email protected]>

* Update retry count for tests

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)
tlfeng pushed a commit that referenced this pull request Apr 6, 2022
* Adds a retry mechanism for all test classes under the root project
* This will help with gradle check failed by flaky tests

Signed-off-by: Kunal Kotwani <[email protected]>
(cherry picked from commit 65cc56e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.x backport 2.x Backport to 2.x branch backport 2.0 Backport to 2.0 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants