Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TransportTasksActionTests testFailedTasksCount failing #94987

Closed
pxsalehi opened this issue Apr 3, 2023 · 7 comments · Fixed by #95081
Closed

[CI] TransportTasksActionTests testFailedTasksCount failing #94987

pxsalehi opened this issue Apr 3, 2023 · 7 comments · Fixed by #95081
Assignees
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@pxsalehi
Copy link
Member

pxsalehi commented Apr 3, 2023

Build scan:
https://gradle-enterprise.elastic.co/s/pqzbzh7gchn2o/tests/:server:test/org.elasticsearch.action.admin.cluster.node.tasks.TransportTasksActionTests/testFailedTasksCount

Reproduction line:

./gradlew ':server:test' --tests "org.elasticsearch.action.admin.cluster.node.tasks.TransportTasksActionTests.testFailedTasksCount" -Dtests.seed=B5302E0879C1D0D3 -Dtests.locale=es-CL -Dtests.timezone=America/La_Paz -Druntime.java=20

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.action.admin.cluster.node.tasks.TransportTasksActionTests&tests.test=testFailedTasksCount

Failure excerpt:

java.lang.AssertionError: eventually completed, but still unexpected:
still running tasks on node [node1]
2: [2][internal:transport/handshake] started at 1680524358899

Expected: <0>
     but: was <98>

  at __randomizedtesting.SeedInfo.seed([B5302E0879C1D0D3:D87A71A124048A38]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.action.admin.cluster.node.tasks.TransportTasksActionTests.testFailedTasksCount(TransportTasksActionTests.java:583)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:578)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1623)

@pxsalehi pxsalehi added :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >test-failure Triaged test failures from CI labels Apr 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 3, 2023
@pxsalehi pxsalehi self-assigned this Apr 5, 2023
@pxsalehi
Copy link
Member Author

pxsalehi commented Apr 6, 2023

I've spend some time looking into this trying to understand the issue considering it has a bit of a history. Fortunately, extra logs are enabled for this. The previous issue of having unexpected internal:transport/handshakes as reported here don't seem to be the cause anymore, and there are matching number of tasks being registered and unregistered. The only reason that the test fails is that right after wiring up the nodes together (since this test doesn't use an actual cluster), we expect task managers to be empty. While this does seem to be a reasonable check after connectNodes is finished, it seems rarely this doesn't hold. Unfortunately, i do not know why!

Is there a reason not to safely ignore that, considering this is a rather artificial setup of a cluster that this particular test has? Either by waiting for an empty task list (at least no handshake actions) before continuing with the test or using the task register/unregister listeners?

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 6, 2023

Ohh I wonder if #94865 broke this. Does that match the timeline? With that change we will respond to the handshake and then unregister the task, so it's no longer safe to assume that the task is gone before as soon as we get a response.

@pxsalehi
Copy link
Member Author

pxsalehi commented Apr 6, 2023

ha! That looks like it. Timeline does match, and that explains why the connectNodes cannot be considered completely done after the call. I guess now we have to wait for empty task managers.

elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2023
Since #94865 task
unregistration  is not guaranteed to have happened upon receiving the
response, e.g. for a  `internal:transport/handshake` when connecting the
test nodes. Therefore,  we need to wait for ongoing tasks to finish.

closes #94987
@fcofdez
Copy link
Contributor

fcofdez commented Apr 17, 2023

Looks like this is still happening https://gradle-enterprise.elastic.co/s/s6vtua2zvk3cg

@fcofdez fcofdez reopened this Apr 17, 2023
@pxsalehi
Copy link
Member Author

That seems to be a failure of testRunningTasksCount. This fix is for testFailedTasksCount. @fcofdez maybe close this, and open a new issue?

@fcofdez
Copy link
Contributor

fcofdez commented Apr 18, 2023

I thought it was the same test 🤦. Sorry!

@fcofdez fcofdez closed this as completed Apr 18, 2023
arteam added a commit to arteam/elasticsearch that referenced this issue Apr 26, 2023
Due to elastic#94865, we now send an ack before the task gets removed, so we can see
a non-zero amount of tasks in the task manager for a short amount of time.

We can just busy wait until the task gets removed.

See elastic#95494, elastic#94987
arteam added a commit that referenced this issue May 9, 2023
* Fix TasksIT.testTasksUnblocking

Due to #94865, we now send an ack before the task gets removed, so we can see
a non-zero amount of tasks in the task manager for a short amount of time.

We can just busy wait until the task gets removed.

See #95494, #94987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants