[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

cbuescher · 2018-09-19T10:45:22Z

Just had this failure in a PR build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/311/console

But I found several instances over time in the CI mails, e.g.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+release-tests/922/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6789/console

and some a while ago in February.

It seems a rare timing issue, so this doesn't reproduce for me locally:

./gradlew :server:test -Dtests.seed=F3E20416D514B7F4 -Dtests.class=org.elasticsearch.transport.RemoteClusterServiceTests -Dtests.method="testCollectSearchShards" -Dtests.security.manager=true -Dtests.locale=es-MX -Dtests.timezone=HST -Dcompiler.java=10 -Druntime.java=8

The failure shows the test is checking a latch after waiting for a second, maybe that is not long enough and should be increased.

11:20:48    > Throwable #1: java.lang.AssertionError
11:20:48    > 	at __randomizedtesting.SeedInfo.seed([F3E20416D514B7F4:BDFDA55350DA8054]:0)
11:20:48    > 	at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:581)
11:20:48    > 	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-09-19T10:45:23Z

Pinging @elastic/es-search-aggs

This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes elastic#33852

This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes #33852

andyb-elastic · 2018-12-03T22:46:07Z

I believe this is the same issue on master today, doesn't reproduce

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/96/console

build-unix-96.txt.zip

  1> [2018-12-03T02:32:03,086][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 4.90s J1 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([730C273FDEA107A9:3D13867A5B6F3009]:0)
   >    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:855)
   >    at java.lang.Thread.run(Thread.java:748)

REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=730C273FDEA107A9 -Dtests.class=org.elasticsearch.transport.RemoteClusterServiceTests -Dtests.method="testCollectSearchShards" -Dtests.security.manager=true -Dtests.locale=he-IL -Dtests.timezone=SystemV/YST9 -Dcompiler.java=11 -Druntime.java=8

spinscale · 2018-12-18T15:52:59Z

another failure waiting on the latch at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/838/console

  1> [2018-12-18T14:57:45,065][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 6.53s J2 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([B51888CD81E2A99A:FB072988042C9E3A]:0)
   >    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:857)
   >    at java.lang.Thread.run(Thread.java:748)

Closes #33852

alpar-t · 2019-01-09T17:05:52Z

Failed again: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/169/console

jkakavas · 2019-01-16T16:21:16Z

Another one today in 6.x : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1026

We have recently gone from 1 second to 5, and we still had some failures. Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug. Relates to elastic#33852

We have recently gone from 1 second to 5, and we still had some failures. Note that testCollectSearchShards was recently moved and reworked from RemoteClusterServiceTests to TransportSearchActionTests. Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug. Relates to elastic#33852

javanna · 2019-02-14T13:32:24Z

I have done quite some digging on this one. I was not able to reproduce this issue even running this test in a loop for hours and hours. Last failure is from January 23rd, at the time this test was failing around one or two times per day. We previously increased the wait for the responses a couple of times. The failure that I have analyzed happens before we even simulate network failures. At the second collectSearchShards call, we send a preference index_not_found which causes an exception to be returned rather than a response. This has the goal of simulating the default behaviour when an exception is received from all the remote clusters. The latch times out after 5 seconds, meaning that we don't get a response from all the clusters in 5 seconds, which seems unlikely given that we are using mock transport services. I could not find any place where we may miss to notify the listener. I found something interesting in the logs though:

  1> [2019-01-21T12:11:25,820][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] before test
  1> [2019-01-21T12:11:26,324][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10600}, bound_addresses {[::1]:10600}, {127.0.0.1:10600}
  1> [2019-01-21T12:11:26,328][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10601}, bound_addresses {[::1]:10601}, {127.0.0.1:10601}
  1> [2019-01-21T12:11:26,336][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10602}, bound_addresses {[::1]:10602}, {127.0.0.1:10602}
  1> [2019-01-21T12:11:26,340][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10603}, bound_addresses {[::1]:10603}, {127.0.0.1:10603}
  1> [2019-01-21T12:11:26,355][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10604}, bound_addresses {[::1]:10604}, {127.0.0.1:10604}
  1> [2019-01-21T12:11:26,360][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10605}, bound_addresses {[::1]:10605}, {127.0.0.1:10605}
  1> [2019-01-21T12:11:26,364][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10606}, bound_addresses {[::1]:10606}, {127.0.0.1:10606}
  1> [2019-01-21T12:11:26,368][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10607}, bound_addresses {[::1]:10607}, {127.0.0.1:10607}
  1> [2019-01-21T12:11:26,373][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10608}, bound_addresses {[::1]:10608}, {127.0.0.1:10608}
  1> [2019-01-21T12:11:34,709][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [50]
  1> [2019-01-21T12:11:34,713][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [55]
  1> [2019-01-21T12:11:34,713][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [49]
  1> [2019-01-21T12:11:34,721][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 8.92s J3 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: expected latch to be counted down after 5s, but was not
   > Expected: is <true>
   >      but: was <false>
   > 	at __randomizedtesting.SeedInfo.seed([B954F1B646535EBB:F74B50F3C39D691B]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.awaitLatch(ElasticsearchAssertions.java:713)
   > 	at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:765)

The three Transport response handler not found of id [***] warnings come back about 8 seconds after the test has started, which seems to indicate that the responses do come back, hence increasing the wait to 10 seconds would have likely made the test succeed. I believe that such warnings indicate that responses have come back late, yet quick enough to hit the sender transport while it was being closed, but before it was shut down. I wonder if this has to do with infra issues in combination with using 9 mock transport services in this specific run.

I am not too happy with increasing the timeout once more like I proposed in #38198 and #38199, at least unless we understand exactly what slows things down. We could try and decrease the number of remote clusters (hence nodes) that we connect to, but I am not sure that is the problem either. Also, this test and corresponding tested code has been moved on master, 7.0 and 7.x to TransportSearchAction and TransportSearchActionTests, with few changes that I don't expect could have solved the issue. Also, the test was left unchanged in 6.7 and it did not fail ever since. That again points me to infra issues that have been solved and have not occurred in the last few weeks.

@tbrooks8 do you have any suggestion on changes to make here or what could have caused this situation?

javanna · 2019-02-19T14:37:26Z

Closing this one as it has not failed in a month or so. I think that infra changes have helped with this.

dnhatn · 2019-06-09T20:50:04Z

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+intake/35/consoleText

org.elasticsearch.action.search.TransportSearchActionTests > testCollectSearchShards FAILED
    java.lang.IllegalStateException: failed to connect to remote clusters
        at __randomizedtesting.SeedInfo.seed([45078AC0A9FE43BB:B182B852C30741B]:0)
        at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:431)
        at org.elasticsearch.transport.TransportService.doStart(TransportService.java:241)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59)
        at org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards(TransportSearchActionTests.java:611)

        Caused by:
        java.util.concurrent.ExecutionException: ConnectTransportException[[][127.0.0.1:10311] connect_timeout[30s]]
            at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
            at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:239)
            at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:65)
            at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:425)
            ... 3 more

            Caused by:
            ConnectTransportException[[][127.0.0.1:10311] connect_timeout[30s]]
REPRODUCE WITH: ./gradlew :server:test --tests "org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards" -Dtests.seed=45078AC0A9FE43BB -Dtests.security.manager=true -Dtests.locale=sv-SE -Dtests.timezone=PRT -Dcompiler.java=12 -Druntime.java=8

javanna · 2019-06-11T15:40:46Z

heya @dnhatn at first glance this seems like a different failure? This is a connect timeout, while historically we have had latches awaits time out here on this test.

cbuescher added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI v7.0.0 v6.5.0 labels Sep 19, 2018

cbuescher mentioned this issue Sep 19, 2018

[Tests] Nudge wait time in RemoteClusterServiceTests #33853

Merged

cbuescher self-assigned this Sep 19, 2018

cbuescher closed this as completed in #33853 Sep 19, 2018

andyb-elastic reopened this Dec 3, 2018

javanna closed this as completed in d3f1fe4 Dec 28, 2018

javanna added a commit that referenced this issue Dec 28, 2018

Increase await timeouts in RemoteClusterServiceTests

8ab9532

Closes #33852

javanna added a commit that referenced this issue Dec 28, 2018

Increase await timeouts in RemoteClusterServiceTests

5d000ed

Closes #33852

alpar-t reopened this Jan 9, 2019

javanna assigned javanna and unassigned cbuescher Feb 1, 2019

This was referenced Feb 1, 2019

[TEST] Increase RemoteClusterServiceTests timeouts #38198

Closed

[TEST] Increase TransportSearchActionTests timeouts #38199

Closed

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

javanna removed the v6.5.0 label Feb 8, 2019

javanna removed the v8.0.0 label Feb 8, 2019

javanna changed the title ~~[CI] RemoteClusterServiceTests.testCollectSearchShards times out waiting in latch~~ [CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch Feb 19, 2019

javanna closed this as completed Feb 19, 2019

bizybot mentioned this issue Jul 18, 2019

[CI] TransportSearchActionTests.testCollectSearchShards fails due to timeout #44563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

cbuescher commented Sep 19, 2018

elasticmachine commented Sep 19, 2018

andyb-elastic commented Dec 3, 2018

spinscale commented Dec 18, 2018 •

edited

Loading

alpar-t commented Jan 9, 2019

jkakavas commented Jan 16, 2019

javanna commented Feb 14, 2019

javanna commented Feb 19, 2019

dnhatn commented Jun 9, 2019

javanna commented Jun 11, 2019

[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

Comments

cbuescher commented Sep 19, 2018

elasticmachine commented Sep 19, 2018

andyb-elastic commented Dec 3, 2018

spinscale commented Dec 18, 2018 • edited Loading

alpar-t commented Jan 9, 2019

jkakavas commented Jan 16, 2019

javanna commented Feb 14, 2019

javanna commented Feb 19, 2019

dnhatn commented Jun 9, 2019

javanna commented Jun 11, 2019

spinscale commented Dec 18, 2018 •

edited

Loading