[CI] NestedObjectMapperTests fail with leaked threads #37543

jkakavas · 2019-01-16T16:41:39Z

This doesn't reproduce locally but has failed 5 times last week. It started failing on January 9th but I don't see relevant changes in the code recently.

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2FM,mode:quick,to:now%2FM))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:NestedObjectMapperTests),sort:!(time,desc))

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1035 is an example

ERROR   0.00s J2 | NestedObjectMapperTests (suite) <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE scope at org.elasticsearch.index.mapper.NestedObjectMapperTests: 
   >    1) Thread[id=927, name=elasticsearch[node_s_0][__mock_network_thread][T#1], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   >    2) Thread[id=928, name=elasticsearch[node_s_0][__mock_network_thread][T#2], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   > 	at __randomizedtesting.SeedInfo.seed([A00E57EA1BC77F08]:0)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   >    1) Thread[id=927, name=elasticsearch[node_s_0][__mock_network_thread][T#1], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   >    2) Thread[id=928, name=elasticsearch[node_s_0][__mock_network_thread][T#2], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   > 	at __randomizedtesting.SeedInfo.seed([A00E57EA1BC77F08]:0)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-16T16:41:41Z

Pinging @elastic/es-search

matriv · 2019-02-04T12:01:20Z

Cannot reproduce it locally after many thousands of iterations.
Found a reproduce command from kibana logs:

./gradlew :server:unitTest -Dtests.seed=C593D66EF8528D36 -Dtests.class=org.elasticsearch.index.mapper.NestedObjectMapperTests -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=Etc/UTC -Dcompiler.java=11 -Druntime.java=8

not-napoleon · 2019-02-06T21:46:18Z

I've done some digging into this one, and I'm pretty sure it's the same issue in #37469 and #32369. I haven't yet traced where the specific leaked threads are being created, or why they only sometimes leak.

not-napoleon · 2019-02-14T17:35:18Z

I'm recategorizing this as a distributed issue, and reaching out to the distributed team for help with it. Here's what I know so far:

This issue and the two linked issues result from threads created in the MockTcpTransport test shim class not properly exiting during node shutdown. That was removed from master in #36628 but still exists in the 6.x line. These threads are cleaned up as part of the node shutdown in EsSingleNodeTestCase, so I don't think there's anything to be gained by muting specific instances of this issue. By similar logic, I don't think this issue is specifically related to the NestedObjectMapperTests.

Incidents of this issue spiked during some of the infra difficulties in January, and have since tapered off. That and the fact that the relevant class is no longer in master might be enough to say we should just close this issue, but I'd like a second opinion on that. You can see the trend in the relevant failures here: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-60d,mode:quick,to:now))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'com.carrotsearch.randomizedtesting.ThreadLeakError%20AND%20__mock_network_thread*'),sort:!(process.time-start,desc))

I'm happy to help out whoever wants to pick this up, but at this point I think it needs someone more familiar with the relevant sections of code than I am.

elasticmachine · 2019-02-14T17:35:49Z

Pinging @elastic/es-distributed

henningandersen · 2019-02-15T13:17:12Z

The log file from:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+multijob-unix-compatibility/os=debian-9/110/console

indicates that Node.stop took 20 seconds:

  1> [2019-02-11T23:57:06,163][INFO ][o.e.n.Node               ] [suite] stopping ...
  1> [2019-02-11T23:57:26,169][INFO ][o.e.n.Node               ] [suite] stopped

I think this comes from MockTransport.stopInternal:

    @Override
    protected void stopInternal() {
        ThreadPool.terminate(executor, 10, TimeUnit.SECONDS);
        synchronized (openChannels) {
            assert openChannels.isEmpty() : "there are still open channels: " + openChannels;
        }
    }

ThreadPool.terminate will wait twice for the timeout, trying to shutdown nicely first and then using shutdownNow.

henningandersen · 2019-02-18T08:32:59Z

There are 3 main ways this could potentially happen (TcpTransport/MockTcpTransport).

The synchronization around serverChannels is not complete and thus close might not see all channels.
The generic() thread pool is shutdown before we get to doStop
Close on the socket does not break accept (JVM bug).

3 sounds unlikely. I tried provoking it but were not successful.
2 I could not find any indications of. Also, we should have had a 30 second wait on the latch in doStop, not a 20 second wait.
1 also seems not to be able to happen, since everything regarding start and stop works sequentially with happens-before relations from Future.get, executor.execute etc.

We should notice that we do get to stopping the transport and that there is a good indication that the problem is the current test's transport due to the 20 second wait time noted above. Also, it does not fail exceptionally, since we get the Stopped message and we also do not get any "uncaught exception" logging entries from QueueUncaughtExceptionsHandler.

As also noted by Mark, master no longer uses MockTcpTransport so seems likely this will not appear.

I will open a PR that does following:

asserts that the generic() threadpool is not shutdown when we enter doStop()
Add synchronized around reading the serverChannels map

This will be backported to both 6.x and 6.6 since this is the only place we still see these errors occasionally.

Notice that the specific tests this affect vary greatly and failures in both ESSingleNodeTestCase and ESIntegTestCase style tests have been found.

Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Relates to CI failure issue: elastic#37543

Further improve the locking around bind and stop to ensure that once transport stops and closes serverChannels, no new channels can be added. Relates to CI failure issue: elastic#37543

Simplify locking around serverChannels a bit. Relates to CI failure issue: elastic#37543

Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543

ywelsch · 2019-03-08T18:27:44Z

@henningandersen can we close this now as #39031 is merged?

henningandersen · 2019-03-11T12:50:14Z

@ywelsch, unfortunately not since we had 3 test failures with leaked threads last week. Will continue investigation based on logs.

ywelsch · 2019-05-07T13:35:11Z

Given that this is 6.x only, and an issue related to a transport implementation which is no longer used in newer branches, I'm inclined to close this here. If this reoccurs more often, we can revisit.

jkakavas added :Search Foundations/Mapping Index mappings, including merging and defining field types >test-failure Triaged test failures from CI labels Jan 16, 2019

tlrx mentioned this issue Feb 1, 2019

[CI] LegacyNestedObjectMapperTests failure in 6.5 #37469

Closed

matriv self-assigned this Feb 1, 2019

matriv removed their assignment Feb 4, 2019

not-napoleon self-assigned this Feb 5, 2019

not-napoleon added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels Feb 14, 2019

not-napoleon removed their assignment Feb 14, 2019

ywelsch assigned henningandersen Feb 15, 2019

henningandersen mentioned this issue Feb 18, 2019

Fix potential race during TcpTransport close #39031

Merged

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019

Fix potential race during TcpTransport close

7de5032

Simplify locking around serverChannels a bit. Relates to CI failure issue: elastic#37543

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019

Fix potential race during TcpTransport close

fab3a0a

Simplify locking around serverChannels a bit. Relates to CI failure issue: elastic#37543

ywelsch closed this as completed May 7, 2019

henningandersen mentioned this issue Jun 19, 2019

Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] NestedObjectMapperTests fail with leaked threads #37543

[CI] NestedObjectMapperTests fail with leaked threads #37543

jkakavas commented Jan 16, 2019

elasticmachine commented Jan 16, 2019

matriv commented Feb 4, 2019

not-napoleon commented Feb 6, 2019

not-napoleon commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

henningandersen commented Feb 15, 2019 •

edited

Loading

henningandersen commented Feb 18, 2019 •

edited

Loading

ywelsch commented Mar 8, 2019

henningandersen commented Mar 11, 2019

ywelsch commented May 7, 2019

[CI] NestedObjectMapperTests fail with leaked threads #37543

[CI] NestedObjectMapperTests fail with leaked threads #37543

Comments

jkakavas commented Jan 16, 2019

elasticmachine commented Jan 16, 2019

matriv commented Feb 4, 2019

not-napoleon commented Feb 6, 2019

not-napoleon commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

henningandersen commented Feb 15, 2019 • edited Loading

henningandersen commented Feb 18, 2019 • edited Loading

ywelsch commented Mar 8, 2019

henningandersen commented Mar 11, 2019

ywelsch commented May 7, 2019

henningandersen commented Feb 15, 2019 •

edited

Loading

henningandersen commented Feb 18, 2019 •

edited

Loading