Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RetryableActionTests testRetryableActionTimeout failing #76165

Closed
cbuescher opened this issue Aug 5, 2021 · 3 comments · Fixed by #95515
Closed

[CI] RetryableActionTests testRetryableActionTimeout failing #76165

cbuescher opened this issue Aug 5, 2021 · 3 comments · Fixed by #95515
Assignees
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/bnq7vzq6jsmda/tests/:server:test/org.elasticsearch.action.support.RetryableActionTests/testRetryableActionTimeout

Reproduction line:
gradlew ':server:test' --tests "org.elasticsearch.action.support.RetryableActionTests.testRetryableActionTimeout" -Dtests.seed=30262AAAB4F0DFBB -Dtests.locale=he-IL -Dtests.timezone=Etc/UCT -Druntime.java=11

Applicable branches:
master

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.action.support.RetryableActionTests&tests.test=testRetryableActionTimeout

Failure excerpt:

java.lang.AssertionError: 
Expected: a value less than or equal to <3000L>
     but: <3761L> was greater than <3000L>

  at __randomizedtesting.SeedInfo.seed([30262AAAB4F0DFBB:6871BFE7697906F5]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.action.support.RetryableActionTests.testRetryableActionTimeout(RetryableActionTests.java:140)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

@cbuescher cbuescher added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Aug 5, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Aug 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@pugnascotia
Copy link
Contributor

I've muted the test on master, 7.x and 7.14.

@stu-elastic
Copy link
Contributor

stu-elastic commented Sep 8, 2021

Stepping through this during core/infra fix it meeting, it seems related to #74940.

In RetryableActionTests.testRetryableActionTimeout, the random time is 1498.

It is retried twice, on RetryableAction.onFailure, the important values are:

// delayMillisBound = 1498
final long nextDelayMillisBound = calculateDelayBound(delayMillisBound);  // 2996
int range = Math.toIntExact((delayMillisBound + 1) / 2);   //  749
// Randomness.get().nextInt(range) = 121
//  121 + 1498 - 749 + 1 = 871
final long delayMillis = Randomness.get().nextInt(range) + delayMillisBound - range + 1L;  // 871

// delayMillisBound = 2996
final long nextDelayMillisBound = calculateDelayBound(delayMillisBound);  // 5992
int range = Math.toIntExact((delayMillisBound + 1) / 2);  // 1498
// Randomness.get().nextInt(range) = 1391
 // 1391 + 2996 - 1498 + 1 = 2890
final long delayMillis = Randomness.get().nextInt(range) + delayMillisBound - range + 1L;  // 2890

So the two delays calculated are: 2890 and 871 which adds up to the 3761 of the failure.

@stu-elastic stu-elastic removed their assignment Sep 8, 2021
Tim-Brooks added a commit that referenced this issue Apr 25, 2023
Currently the retry delays can grow significantly beyond the timeout in
the retryable action. This has surfaced itself in tests. This commit
improves the situation by only allowing a retry to be scheduled 20%
beyond the configured timeout. Additionally, it allows a max delay bound
parameter to be passed which will limit the max bound that can occur
during exponential backoffs.

Fixes #76165.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants