Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TransportSearchActionTests.testCollectSearchShards fails due to timeout #44563

Closed
bizybot opened this issue Jul 18, 2019 · 6 comments
Closed
Assignees
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI

Comments

@bizybot
Copy link
Contributor

bizybot commented Jul 18, 2019

Intake build failed for branch 7.2:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+intake/262/console
Locally I could not reproduce this with following:

./gradlew :server:test --tests "org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards" -Dtests.seed=BCDB8CE0033546CE -Dtests.security.manager=true -Dtests.locale=es-PA -Dtests.timezone=America/Winnipeg -Dcompiler.java=12 -Druntime.java=8
21:28:53 org.elasticsearch.action.search.TransportSearchActionTests > testCollectSearchShards FAILED
21:28:53     java.lang.IllegalStateException: failed to connect to remote clusters
21:28:53         at __randomizedtesting.SeedInfo.seed([BCDB8CE0033546CE:F2C42DA586FB716E]:0)
21:28:53         at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:432)
21:28:53         at org.elasticsearch.transport.TransportService.doStart(TransportService.java:241)
21:28:53         at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59)
21:28:53         at org.elasticsearch.action.search.TransportSearchActionTests.testCollectSearchShards(TransportSearchActionTests.java:611)
21:28:53 
21:28:53         Caused by:
21:28:53         java.util.concurrent.ExecutionException: ConnectTransportException[[][127.0.0.1:10308] connect_timeout[30s]]
21:28:53             at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
21:28:53             at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:239)
21:28:53             at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:65)
21:28:53             at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:426)
21:28:53             ... 3 more
21:28:53 
21:28:53             Caused by:
21:28:53             ConnectTransportException[[][127.0.0.1:10308] connect_timeout[30s]]

In the past, this has failed but was reported in another issue: #33852 (comment)

@bizybot bizybot added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI v7.2.0 labels Jul 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@javanna javanna self-assigned this Jul 22, 2019
@javanna
Copy link
Member

javanna commented Aug 15, 2019

Looking at build-stats, this one fails about 5 times a month, with either "failed to connect to remote clusters", or "expected latch to be counted down after 5s, but was not". Seems like the root cause could be the same for both errors. See this long comment for previous analysis on this one: #33852 (comment) .

@javanna javanna added :Distributed Coordination/Network Http and internode communication implementations and removed :Search/Search Search-related issues that do not fall into other categories labels Aug 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@javanna javanna removed their assignment Aug 15, 2019
@javanna javanna removed the v7.2.0 label Aug 15, 2019
@javanna
Copy link
Member

javanna commented Aug 15, 2019

I don't see anything CCS or search specific in this failure, hence I relabeled it to :Distributed/Network. Let me know what I can do to help debug it.

@henningandersen henningandersen self-assigned this Aug 21, 2019
@henningandersen
Copy link
Contributor

henningandersen commented Aug 21, 2019

The first failure in newer history occurred on May 14th (previous one was January 23rd, which may speculatively have been caused by infra issues). This correlates somewhat with #40978 (April 8), which caused issues until #43983 (July 5). Notice that this test uses MockTransportService.createNewService, which calculates a port range to use, which was wrong until July 5th. Unfortunately, #43983 never went into 7.2, thus hitting this build on July 18th too.

There is one newer 7.x build failure: https://scans.gradle.com/s/zpxr2bfpfnw2q, but this contains messages like this one:

[2019-08-06T07:09:46,170][WARN ][o.e.t.n.MockNioTransport ] [org.elasticsearch.action.search.TransportSearchActionTests] Potentially blocked execution on network thread [elasticsearch[node_remote2][transport_worker][T#2]] [RUNNABLE] [30021 milliseconds]: 
sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)

and ran on debian, indicating it was hit by #43387.

This is certainly not conclusive, but seems like a plausible explanation.

Will leave this issue open for a while to see if we hit this error again.

@henningandersen
Copy link
Contributor

The test has not failed again, so will close this, assuming above explanation holds true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants