Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

Closed
jtibshirani opened this issue Oct 23, 2018 · 5 comments
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@jtibshirani
Copy link
Contributor

It looks like the cluster from SettingsBasedHostProviderIT#testClusterFormsWithSingleSeedHostInSettings isn't always properly shut down (we can see SEVERE: There are still zombie threads that couldn't be terminated: ... in the logs). This causes the suite to fail as a whole because threads are leaked, and also the next case SettingsBasedHostProviderIT#testClusterFormsByScanningPorts fails with a port conflict. I haven't been able to reproduce this locally.


Link to the build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/14/console

Example reproduction line:

./gradlew :server:integTest \
  -Dtests.seed=D9449C904EA3FE3D \
  -Dtests.class=org.elasticsearch.discovery.zen.SettingsBasedHostProviderIT \
  -Dtests.security.manager=true \
  -Dtests.locale=lv-LV \
  -Dtests.timezone=Africa/Khartoum \
  -Dcompiler.java=11 \
  -Druntime.java=8

Relevant excerpts from the logs:

SEVERE: There are still zombie threads that couldn't be terminated:
  1>    {node_t2}{eTfDiXEjTeW36FIwAQPIMw}{_-79Lb0zQzS2QNF3xm2TUg}{127.0.0.1}{127.0.0.1:44959}
  1>    {node_t1}{C8ielKNXQHy_CWeeULZJHg}{shpu0GWiSoOaJY1WTe6anQ}{127.0.0.1}{127.0.0.1:33331}, local
  2>    1) Thread[id=9776, name=elasticsearch[node_t1][generic][T#1], state=WAITING, group=TGRP-SettingsBasedHostProviderIT]
  2>         at sun.misc.Unsafe.park(Native Method)
  1> [2018-10-23T13:45:52,004][WARN ][o.e.t.n.MockNioTransport ] [node_t3] send message failed [channel: NioSocketChannel{localAddress=0.0.0.0/0.0.0.0:40045, remoteAddress=/127.0.0.1:45116}]
  1> java.nio.channels.ClosedChannelException: null
  2>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  1> 	at org.elasticsearch.nio.SocketChannelContext.sendMessage(SocketChannelContext.java:131) [elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  2>         at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$MockSocketChannel.sendMessage(MockNioTransport.java:288) [framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  2>         at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
  2>         at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
  1> 	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:921) [main/:?]
  2>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
  2>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
  2>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1010) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:978) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:66) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:60) [main/:?]
  2>         at java.lang.Thread.run(Thread.java:748)
  1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [main/:?]
  2>    2) Thread[id=9772, name=elasticsearch[node_t1][scheduler][T#1], state=TIMED_WAITING, group=TGRP-SettingsBasedHostProviderIT]
  2>         at sun.misc.Unsafe.park(Native Method)
  2>         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
  1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:287) [main/:?]
  1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:282) [main/:?]
  1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1448) [main/:?]
  2>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
  2>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
  2>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
  2>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
  2>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
  2>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  2>         at java.lang.Thread.run(Thread.java:748)
...
ERROR   0.12s J1 | SettingsBasedHostProviderIT.testClusterFormsByScanningPorts <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: failed to start nodes
   > 	at __randomizedtesting.SeedInfo.seed([D9449C904EA3FE3D:47EF44A5DD493630]:0)
   > 	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1574)
   > 	at org.elasticsearch.test.InternalTestCluster.startNodes(InternalTestCluster.java:1891)
   > 	at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:1859)
   > 	at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:1852)
   > 	at org.elasticsearch.discovery.zen.SettingsBasedHostProviderIT.testClusterFormsByScanningPorts(SettingsBasedHostProviderIT.java:78)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [44859-44860]]; nested: BindException[Address already in use];
   > 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   > 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   > 	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1569)
   > 	... 41 more

Full build log: build_log.txt

@jtibshirani jtibshirani added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jtibshirani
Copy link
Contributor Author

Muted the test in c5a0739.

@DaveCTurner
Copy link
Contributor

I think this failure is actually because the nodes failed to start up, and the message about leaking threads is because the test then bailed out without doing any cleanup:

   > Throwable #1: java.lang.RuntimeException: failed to start nodes
...
   > Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [44859-44860]]; nested: BindException[Address already in use];

Looking at testClusterFormsByScanningPorts, I see that this is not wholly unexpected:

// This test will fail if all 4 ports just less than the one used by the first node are already bound by something else. It's hard
// to know how often this might happen in reality, so let's try it and see.

We didn't, however, discuss what to do if this happened, so I'm marking this for team discussion.

@DaveCTurner DaveCTurner changed the title SettingsBasedHostProviderIT can fail because of leaked threads. SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port Oct 24, 2018
@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 24, 2018

We discussed this in today's team sync. This test is testing a property that's important for the OOTB experience of using Elasticsearch and starting multiple nodes on the same host, even though it's not that useful in real production deployments. It's therefore something we want to continue to test. However it's always going to be tricky to reliably find two free ports that are ≤5 apart in an integration test, and there's no option to increase the "5" in this statement. We therefore decided that we should try and test this in a packaging test, since these run in a much cleaner environment.

I think this specific test can be deleted and I will open an issue about adding this kind of packaging test after speaking to some people about it.

@DaveCTurner
Copy link
Contributor

Closed by #34813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants