Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] NestedObjectMapperTests fail with leaked threads #37543

Closed
jkakavas opened this issue Jan 16, 2019 · 10 comments
Closed

[CI] NestedObjectMapperTests fail with leaked threads #37543

jkakavas opened this issue Jan 16, 2019 · 10 comments
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@jkakavas
Copy link
Member

This doesn't reproduce locally but has failed 5 times last week. It started failing on January 9th but I don't see relevant changes in the code recently.

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2FM,mode:quick,to:now%2FM))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:NestedObjectMapperTests),sort:!(time,desc))

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1035 is an example

ERROR   0.00s J2 | NestedObjectMapperTests (suite) <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE scope at org.elasticsearch.index.mapper.NestedObjectMapperTests: 
   >    1) Thread[id=927, name=elasticsearch[node_s_0][__mock_network_thread][T#1], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   >    2) Thread[id=928, name=elasticsearch[node_s_0][__mock_network_thread][T#2], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   > 	at __randomizedtesting.SeedInfo.seed([A00E57EA1BC77F08]:0)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   >    1) Thread[id=927, name=elasticsearch[node_s_0][__mock_network_thread][T#1], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   >    2) Thread[id=928, name=elasticsearch[node_s_0][__mock_network_thread][T#2], state=RUNNABLE, group=TGRP-NestedObjectMapperTests]
   >         at java.net.PlainSocketImpl.socketAccept(Native Method)
   >         at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
   >         at java.net.ServerSocket.implAccept(ServerSocket.java:545)
   >         at java.net.ServerSocket.accept(ServerSocket.java:513)
   >         at org.elasticsearch.mocksocket.MockServerSocket.access$101(MockServerSocket.java:32)
   >         at org.elasticsearch.mocksocket.MockServerSocket.lambda$accept$0(MockServerSocket.java:53)
   >         at org.elasticsearch.mocksocket.MockServerSocket$$Lambda$595/2121068139.run(Unknown Source)
   >         at java.security.AccessController.doPrivileged(Native Method)
   >         at org.elasticsearch.mocksocket.MockServerSocket.accept(MockServerSocket.java:53)
   >         at org.elasticsearch.transport.MockTcpTransport$MockChannel.accept(MockTcpTransport.java:290)
   >         at org.elasticsearch.transport.MockTcpTransport$1.doRun(MockTcpTransport.java:135)
   >         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >         at java.lang.Thread.run(Thread.java:748)
   > 	at __randomizedtesting.SeedInfo.seed([A00E57EA1BC77F08]:0)
@jkakavas jkakavas added :Search Foundations/Mapping Index mappings, including merging and defining field types >test-failure Triaged test failures from CI labels Jan 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@matriv
Copy link
Contributor

matriv commented Feb 4, 2019

Cannot reproduce it locally after many thousands of iterations.
Found a reproduce command from kibana logs:

./gradlew :server:unitTest -Dtests.seed=C593D66EF8528D36 -Dtests.class=org.elasticsearch.index.mapper.NestedObjectMapperTests -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=Etc/UTC -Dcompiler.java=11 -Druntime.java=8

@matriv matriv removed their assignment Feb 4, 2019
@not-napoleon not-napoleon self-assigned this Feb 5, 2019
@not-napoleon
Copy link
Member

I've done some digging into this one, and I'm pretty sure it's the same issue in #37469 and #32369. I haven't yet traced where the specific leaked threads are being created, or why they only sometimes leak.

@not-napoleon
Copy link
Member

I'm recategorizing this as a distributed issue, and reaching out to the distributed team for help with it. Here's what I know so far:

This issue and the two linked issues result from threads created in the MockTcpTransport test shim class not properly exiting during node shutdown. That was removed from master in #36628 but still exists in the 6.x line. These threads are cleaned up as part of the node shutdown in EsSingleNodeTestCase, so I don't think there's anything to be gained by muting specific instances of this issue. By similar logic, I don't think this issue is specifically related to the NestedObjectMapperTests.

Incidents of this issue spiked during some of the infra difficulties in January, and have since tapered off. That and the fact that the relevant class is no longer in master might be enough to say we should just close this issue, but I'd like a second opinion on that. You can see the trend in the relevant failures here: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-60d,mode:quick,to:now))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'com.carrotsearch.randomizedtesting.ThreadLeakError%20AND%20__mock_network_thread*'),sort:!(process.time-start,desc))

I'm happy to help out whoever wants to pick this up, but at this point I think it needs someone more familiar with the relevant sections of code than I am.

@not-napoleon not-napoleon added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels Feb 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@not-napoleon not-napoleon removed their assignment Feb 14, 2019
@henningandersen
Copy link
Contributor

henningandersen commented Feb 15, 2019

The log file from:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+multijob-unix-compatibility/os=debian-9/110/console

indicates that Node.stop took 20 seconds:

  1> [2019-02-11T23:57:06,163][INFO ][o.e.n.Node               ] [suite] stopping ...
  1> [2019-02-11T23:57:26,169][INFO ][o.e.n.Node               ] [suite] stopped

I think this comes from MockTransport.stopInternal:

    @Override
    protected void stopInternal() {
        ThreadPool.terminate(executor, 10, TimeUnit.SECONDS);
        synchronized (openChannels) {
            assert openChannels.isEmpty() : "there are still open channels: " + openChannels;
        }
    }

ThreadPool.terminate will wait twice for the timeout, trying to shutdown nicely first and then using shutdownNow.

@henningandersen
Copy link
Contributor

henningandersen commented Feb 18, 2019

There are 3 main ways this could potentially happen (TcpTransport/MockTcpTransport).

  1. The synchronization around serverChannels is not complete and thus close might not see all channels.
  2. The generic() thread pool is shutdown before we get to doStop
  3. Close on the socket does not break accept (JVM bug).

3 sounds unlikely. I tried provoking it but were not successful.
2 I could not find any indications of. Also, we should have had a 30 second wait on the latch in doStop, not a 20 second wait.
1 also seems not to be able to happen, since everything regarding start and stop works sequentially with happens-before relations from Future.get, executor.execute etc.

We should notice that we do get to stopping the transport and that there is a good indication that the problem is the current test's transport due to the 20 second wait time noted above. Also, it does not fail exceptionally, since we get the Stopped message and we also do not get any "uncaught exception" logging entries from QueueUncaughtExceptionsHandler.

As also noted by Mark, master no longer uses MockTcpTransport so seems likely this will not appear.

I will open a PR that does following:

  1. asserts that the generic() threadpool is not shutdown when we enter doStop()
  2. Add synchronized around reading the serverChannels map

This will be backported to both 6.x and 6.6 since this is the only place we still see these errors occasionally.

Notice that the specific tests this affect vary greatly and failures in both ESSingleNodeTestCase and ESIntegTestCase style tests have been found.

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Relates to CI failure issue: elastic#37543
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019
Further improve the locking around bind and stop to ensure that once
transport stops and closes serverChannels, no new channels can be added.

Relates to CI failure issue: elastic#37543
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019
Simplify locking around serverChannels a bit.

Relates to CI failure issue: elastic#37543
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Feb 18, 2019
Simplify locking around serverChannels a bit.

Relates to CI failure issue: elastic#37543
henningandersen added a commit that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
henningandersen added a commit that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
henningandersen added a commit that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
henningandersen added a commit that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
henningandersen added a commit that referenced this issue Feb 18, 2019
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
@ywelsch
Copy link
Contributor

ywelsch commented Mar 8, 2019

@henningandersen can we close this now as #39031 is merged?

@henningandersen
Copy link
Contributor

@ywelsch, unfortunately not since we had 3 test failures with leaked threads last week. Will continue investigation based on logs.

@ywelsch
Copy link
Contributor

ywelsch commented May 7, 2019

Given that this is 6.x only, and an issue related to a transport implementation which is no longer used in newer branches, I'm inclined to close this here. If this reoccurs more often, we can revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants