SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

jtibshirani · 2018-10-23T22:03:31Z

It looks like the cluster from SettingsBasedHostProviderIT#testClusterFormsWithSingleSeedHostInSettings isn't always properly shut down (we can see SEVERE: There are still zombie threads that couldn't be terminated: ... in the logs). This causes the suite to fail as a whole because threads are leaked, and also the next case SettingsBasedHostProviderIT#testClusterFormsByScanningPorts fails with a port conflict. I haven't been able to reproduce this locally.

Link to the build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/14/console

Example reproduction line:

./gradlew :server:integTest \
  -Dtests.seed=D9449C904EA3FE3D \
  -Dtests.class=org.elasticsearch.discovery.zen.SettingsBasedHostProviderIT \
  -Dtests.security.manager=true \
  -Dtests.locale=lv-LV \
  -Dtests.timezone=Africa/Khartoum \
  -Dcompiler.java=11 \
  -Druntime.java=8

Relevant excerpts from the logs:

SEVERE: There are still zombie threads that couldn't be terminated:
  1>    {node_t2}{eTfDiXEjTeW36FIwAQPIMw}{_-79Lb0zQzS2QNF3xm2TUg}{127.0.0.1}{127.0.0.1:44959}
  1>    {node_t1}{C8ielKNXQHy_CWeeULZJHg}{shpu0GWiSoOaJY1WTe6anQ}{127.0.0.1}{127.0.0.1:33331}, local
  2>    1) Thread[id=9776, name=elasticsearch[node_t1][generic][T#1], state=WAITING, group=TGRP-SettingsBasedHostProviderIT]
  2>         at sun.misc.Unsafe.park(Native Method)
  1> [2018-10-23T13:45:52,004][WARN ][o.e.t.n.MockNioTransport ] [node_t3] send message failed [channel: NioSocketChannel{localAddress=0.0.0.0/0.0.0.0:40045, remoteAddress=/127.0.0.1:45116}]
  1> java.nio.channels.ClosedChannelException: null
  2>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  1> 	at org.elasticsearch.nio.SocketChannelContext.sendMessage(SocketChannelContext.java:131) [elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  2>         at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$MockSocketChannel.sendMessage(MockNioTransport.java:288) [framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  2>         at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
  2>         at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
  1> 	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:921) [main/:?]
  2>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
  2>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
  2>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1010) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:978) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:66) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:60) [main/:?]
  2>         at java.lang.Thread.run(Thread.java:748)
  1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [main/:?]
  2>    2) Thread[id=9772, name=elasticsearch[node_t1][scheduler][T#1], state=TIMED_WAITING, group=TGRP-SettingsBasedHostProviderIT]
  2>         at sun.misc.Unsafe.park(Native Method)
  2>         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
  1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:287) [main/:?]
  1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:282) [main/:?]
  1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1448) [main/:?]
  2>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
  2>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
  2>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
  2>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
  2>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
  2>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  2>         at java.lang.Thread.run(Thread.java:748)
...
ERROR   0.12s J1 | SettingsBasedHostProviderIT.testClusterFormsByScanningPorts <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: failed to start nodes
   > 	at __randomizedtesting.SeedInfo.seed([D9449C904EA3FE3D:47EF44A5DD493630]:0)
   > 	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1574)
   > 	at org.elasticsearch.test.InternalTestCluster.startNodes(InternalTestCluster.java:1891)
   > 	at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:1859)
   > 	at org.elasticsearch.test.InternalTestCluster.startNode(InternalTestCluster.java:1852)
   > 	at org.elasticsearch.discovery.zen.SettingsBasedHostProviderIT.testClusterFormsByScanningPorts(SettingsBasedHostProviderIT.java:78)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [44859-44860]]; nested: BindException[Address already in use];
   > 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   > 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   > 	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1569)
   > 	... 41 more

Full build log: build_log.txt

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-23T22:03:32Z

Pinging @elastic/es-distributed

jtibshirani · 2018-10-23T22:29:32Z

Muted the test in c5a0739.

DaveCTurner · 2018-10-24T07:32:52Z

I think this failure is actually because the nodes failed to start up, and the message about leaking threads is because the test then bailed out without doing any cleanup:

   > Throwable #1: java.lang.RuntimeException: failed to start nodes
...
   > Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [44859-44860]]; nested: BindException[Address already in use];

Looking at testClusterFormsByScanningPorts, I see that this is not wholly unexpected:

elasticsearch/server/src/test/java/org/elasticsearch/discovery/zen/SettingsBasedHostProviderIT.java

Lines 69 to 70 in 18e004c

    
           // This test will fail if all 4 ports just less than the one used by the first node are already bound by something else. It's hard 
        
           // to know how often this might happen in reality, so let's try it and see.

We didn't, however, discuss what to do if this happened, so I'm marking this for team discussion.

DaveCTurner · 2018-10-24T14:44:36Z

We discussed this in today's team sync. This test is testing a property that's important for the OOTB experience of using Elasticsearch and starting multiple nodes on the same host, even though it's not that useful in real production deployments. It's therefore something we want to continue to test. However it's always going to be tricky to reliably find two free ports that are ≤5 apart in an integration test, and there's no option to increase the "5" in this statement. We therefore decided that we should try and test this in a packaging test, since these run in a much cleaner environment.

I think this specific test can be deleted and I will open an issue about adding this kind of packaging test after speaking to some people about it.

DaveCTurner · 2018-10-26T07:20:06Z

Closed by #34813

jtibshirani added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 23, 2018

DaveCTurner changed the title ~~SettingsBasedHostProviderIT can fail because of leaked threads.~~ SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port Oct 24, 2018

DaveCTurner added the team-discuss label Oct 24, 2018

DaveCTurner removed the team-discuss label Oct 24, 2018

DaveCTurner self-assigned this Oct 24, 2018

DaveCTurner mentioned this issue Oct 24, 2018

Form multi-node clusters in packaging tests #34809

Closed

DaveCTurner closed this as completed Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

jtibshirani commented Oct 23, 2018

elasticmachine commented Oct 23, 2018

jtibshirani commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

DaveCTurner commented Oct 24, 2018 •

edited

Loading

DaveCTurner commented Oct 26, 2018

SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

SettingsBasedHostProviderIT#testClusterFormsByScanningPorts couldn't find a free port #34781

Comments

jtibshirani commented Oct 23, 2018

elasticmachine commented Oct 23, 2018

jtibshirani commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

DaveCTurner commented Oct 24, 2018 • edited Loading

DaveCTurner commented Oct 26, 2018

DaveCTurner commented Oct 24, 2018 •

edited

Loading