Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852

Closed
cbuescher opened this issue Sep 19, 2018 · 9 comments · Fixed by #33853
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

Just had this failure in a PR build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/311/console

But I found several instances over time in the CI mails, e.g.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+release-tests/922/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6789/console

and some a while ago in February.

It seems a rare timing issue, so this doesn't reproduce for me locally:

./gradlew :server:test -Dtests.seed=F3E20416D514B7F4 -Dtests.class=org.elasticsearch.transport.RemoteClusterServiceTests -Dtests.method="testCollectSearchShards" -Dtests.security.manager=true -Dtests.locale=es-MX -Dtests.timezone=HST -Dcompiler.java=10 -Druntime.java=8

The failure shows the test is checking a latch after waiting for a second, maybe that is not long enough and should be increased.

11:20:48    > Throwable #1: java.lang.AssertionError
11:20:48    > 	at __randomizedtesting.SeedInfo.seed([F3E20416D514B7F4:BDFDA55350DA8054]:0)
11:20:48    > 	at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:581)
11:20:48    > 	at java.lang.Thread.run(Thread.java:748)
@cbuescher cbuescher added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI v7.0.0 v6.5.0 labels Sep 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Sep 19, 2018
This test occasionally fails in `testCollectSearchShards` waiting on what seems
to be a search request to a remote cluster for one second. Given that the test
fails here very rarely I suspect maybe one second is very rarely not enough so
we could fix it by increasing the max wait time slightly.

Closes elastic#33852
@cbuescher cbuescher self-assigned this Sep 19, 2018
cbuescher pushed a commit that referenced this issue Sep 19, 2018
This test occasionally fails in `testCollectSearchShards` waiting on what seems
to be a search request to a remote cluster for one second. Given that the test
fails here very rarely I suspect maybe one second is very rarely not enough so
we could fix it by increasing the max wait time slightly.

Closes #33852
cbuescher pushed a commit that referenced this issue Sep 19, 2018
This test occasionally fails in `testCollectSearchShards` waiting on what seems
to be a search request to a remote cluster for one second. Given that the test
fails here very rarely I suspect maybe one second is very rarely not enough so
we could fix it by increasing the max wait time slightly.

Closes #33852
danielmitterdorfer pushed a commit that referenced this issue Sep 28, 2018
This test occasionally fails in `testCollectSearchShards` waiting on what seems
to be a search request to a remote cluster for one second. Given that the test
fails here very rarely I suspect maybe one second is very rarely not enough so
we could fix it by increasing the max wait time slightly.

Closes #33852
@andyb-elastic
Copy link
Contributor

I believe this is the same issue on master today, doesn't reproduce

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/96/console

build-unix-96.txt.zip

  1> [2018-12-03T02:32:03,086][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 4.90s J1 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([730C273FDEA107A9:3D13867A5B6F3009]:0)
   >    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:855)
   >    at java.lang.Thread.run(Thread.java:748)
REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=730C273FDEA107A9 -Dtests.class=org.elasticsearch.transport.RemoteClusterServiceTests -Dtests.method="testCollectSearchShards" -Dtests.security.manager=true -Dtests.locale=he-IL -Dtests.timezone=SystemV/YST9 -Dcompiler.java=11 -Druntime.java=8

@andyb-elastic andyb-elastic reopened this Dec 3, 2018
@spinscale
Copy link
Contributor

spinscale commented Dec 18, 2018

another failure waiting on the latch at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/838/console

  1> [2018-12-18T14:57:45,065][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 6.53s J2 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([B51888CD81E2A99A:FB072988042C9E3A]:0)
   >    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:857)
   >    at java.lang.Thread.run(Thread.java:748)

@alpar-t
Copy link
Contributor

alpar-t commented Jan 9, 2019

@alpar-t alpar-t reopened this Jan 9, 2019
@jkakavas
Copy link
Member

@javanna javanna assigned javanna and unassigned cbuescher Feb 1, 2019
javanna added a commit to javanna/elasticsearch that referenced this issue Feb 1, 2019
We have recently gone from 1 second to 5, and we still had some failures.
Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug.

Relates to elastic#33852
javanna added a commit to javanna/elasticsearch that referenced this issue Feb 1, 2019
We have recently gone from 1 second to 5, and we still had some failures. Note that testCollectSearchShards was recently moved and reworked from RemoteClusterServiceTests to TransportSearchActionTests.
Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug.

Relates to elastic#33852
@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@javanna javanna removed the v6.5.0 label Feb 8, 2019
@javanna javanna removed the v8.0.0 label Feb 8, 2019
@javanna
Copy link
Member

javanna commented Feb 14, 2019

I have done quite some digging on this one. I was not able to reproduce this issue even running this test in a loop for hours and hours. Last failure is from January 23rd, at the time this test was failing around one or two times per day. We previously increased the wait for the responses a couple of times. The failure that I have analyzed happens before we even simulate network failures. At the second collectSearchShards call, we send a preference index_not_found which causes an exception to be returned rather than a response. This has the goal of simulating the default behaviour when an exception is received from all the remote clusters. The latch times out after 5 seconds, meaning that we don't get a response from all the clusters in 5 seconds, which seems unlikely given that we are using mock transport services. I could not find any place where we may miss to notify the listener. I found something interesting in the logs though:

  1> [2019-01-21T12:11:25,820][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] before test
  1> [2019-01-21T12:11:26,324][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10600}, bound_addresses {[::1]:10600}, {127.0.0.1:10600}
  1> [2019-01-21T12:11:26,328][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10601}, bound_addresses {[::1]:10601}, {127.0.0.1:10601}
  1> [2019-01-21T12:11:26,336][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10602}, bound_addresses {[::1]:10602}, {127.0.0.1:10602}
  1> [2019-01-21T12:11:26,340][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10603}, bound_addresses {[::1]:10603}, {127.0.0.1:10603}
  1> [2019-01-21T12:11:26,355][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10604}, bound_addresses {[::1]:10604}, {127.0.0.1:10604}
  1> [2019-01-21T12:11:26,360][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10605}, bound_addresses {[::1]:10605}, {127.0.0.1:10605}
  1> [2019-01-21T12:11:26,364][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10606}, bound_addresses {[::1]:10606}, {127.0.0.1:10606}
  1> [2019-01-21T12:11:26,368][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10607}, bound_addresses {[::1]:10607}, {127.0.0.1:10607}
  1> [2019-01-21T12:11:26,373][INFO ][o.e.t.TransportService   ] [testCollectSearchShards] publish_address {127.0.0.1:10608}, bound_addresses {[::1]:10608}, {127.0.0.1:10608}
  1> [2019-01-21T12:11:34,709][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [50]
  1> [2019-01-21T12:11:34,713][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [55]
  1> [2019-01-21T12:11:34,713][WARN ][o.e.t.TransportService   ] [[elasticsearch[transport_worker][T#2]]] Transport response handler not found of id [49]
  1> [2019-01-21T12:11:34,721][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectSearchShards] after test
FAILURE 8.92s J3 | RemoteClusterServiceTests.testCollectSearchShards <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: expected latch to be counted down after 5s, but was not
   > Expected: is <true>
   >      but: was <false>
   > 	at __randomizedtesting.SeedInfo.seed([B954F1B646535EBB:F74B50F3C39D691B]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.awaitLatch(ElasticsearchAssertions.java:713)
   > 	at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectSearchShards(RemoteClusterServiceTests.java:765)

The three Transport response handler not found of id [***] warnings come back about 8 seconds after the test has started, which seems to indicate that the responses do come back, hence increasing the wait to 10 seconds would have likely made the test succeed. I believe that such warnings indicate that responses have come back late, yet quick enough to hit the sender transport while it was being closed, but before it was shut down. I wonder if this has to do with infra issues in combination with using 9 mock transport services in this specific run.

I am not too happy with increasing the timeout once more like I proposed in #38198 and #38199, at least unless we understand exactly what slows things down. We could try and decrease the number of remote clusters (hence nodes) that we connect to, but I am not sure that is the problem either. Also, this test and corresponding tested code has been moved on master, 7.0 and 7.x to TransportSearchAction and TransportSearchActionTests, with few changes that I don't expect could have solved the issue. Also, the test was left unchanged in 6.7 and it did not fail ever since. That again points me to infra issues that have been solved and have not occurred in the last few weeks.

@tbrooks8 do you have any suggestion on changes to make here or what could have caused this situation?

@javanna javanna changed the title [CI] RemoteClusterServiceTests.testCollectSearchShards times out waiting in latch [CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch Feb 19, 2019
@javanna
Copy link
Member

javanna commented Feb 19, 2019

Closing this one as it has not failed in a month or so. I think that infra changes have helped with this.

@javanna javanna closed this as completed Feb 19, 2019
@dnhatn
Copy link
Member

dnhatn commented Jun 9, 2019

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+intake/35/consoleText

org.elasticsearch.action.search.TransportSearchActionTests > testCollectSearchShards FAILED
    java.lang.IllegalStateException: failed to connect to remote clusters
        at __randomizedtesting.SeedInfo.seed([45078AC0A9FE43BB:B182B852C30741B]:0)
        at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:431)
        at org.elasticsearch.transport.TransportService.doStart(TransportService.java:241)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59)
        at org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards(TransportSearchActionTests.java:611)

        Caused by:
        java.util.concurrent.ExecutionException: ConnectTransportException[[][127.0.0.1:10311] connect_timeout[30s]]
            at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
            at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:239)
            at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:65)
            at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:425)
            ... 3 more

            Caused by:
            ConnectTransportException[[][127.0.0.1:10311] connect_timeout[30s]]
REPRODUCE WITH: ./gradlew :server:test --tests "org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards" -Dtests.seed=45078AC0A9FE43BB -Dtests.security.manager=true -Dtests.locale=sv-SE -Dtests.timezone=PRT -Dcompiler.java=12 -Druntime.java=8

@javanna
Copy link
Member

javanna commented Jun 11, 2019

heya @dnhatn at first glance this seems like a different failure? This is a connect timeout, while historically we have had latches awaits time out here on this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants