-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] NestedObjectMapperTests fail with leaked threads #37543
Comments
Pinging @elastic/es-search |
Cannot reproduce it locally after many thousands of iterations.
|
I'm recategorizing this as a distributed issue, and reaching out to the distributed team for help with it. Here's what I know so far: This issue and the two linked issues result from threads created in the Incidents of this issue spiked during some of the infra difficulties in January, and have since tapered off. That and the fact that the relevant class is no longer in master might be enough to say we should just close this issue, but I'd like a second opinion on that. You can see the trend in the relevant failures here: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-60d,mode:quick,to:now))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'com.carrotsearch.randomizedtesting.ThreadLeakError%20AND%20__mock_network_thread*'),sort:!(process.time-start,desc)) I'm happy to help out whoever wants to pick this up, but at this point I think it needs someone more familiar with the relevant sections of code than I am. |
Pinging @elastic/es-distributed |
The log file from: indicates that Node.stop took 20 seconds:
I think this comes from MockTransport.stopInternal:
ThreadPool.terminate will wait twice for the timeout, trying to shutdown nicely first and then using shutdownNow. |
There are 3 main ways this could potentially happen (TcpTransport/MockTcpTransport).
3 sounds unlikely. I tried provoking it but were not successful. We should notice that we do get to stopping the transport and that there is a good indication that the problem is the current test's transport due to the 20 second wait time noted above. Also, it does not fail exceptionally, since we get the Stopped message and we also do not get any "uncaught exception" logging entries from As also noted by Mark, master no longer uses MockTcpTransport so seems likely this will not appear. I will open a PR that does following:
This will be backported to both 6.x and 6.6 since this is the only place we still see these errors occasionally. Notice that the specific tests this affect vary greatly and failures in both |
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Relates to CI failure issue: elastic#37543
Further improve the locking around bind and stop to ensure that once transport stops and closes serverChannels, no new channels can be added. Relates to CI failure issue: elastic#37543
Simplify locking around serverChannels a bit. Relates to CI failure issue: elastic#37543
Simplify locking around serverChannels a bit. Relates to CI failure issue: elastic#37543
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543
Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543
@henningandersen can we close this now as #39031 is merged? |
@ywelsch, unfortunately not since we had 3 test failures with leaked threads last week. Will continue investigation based on logs. |
Given that this is 6.x only, and an issue related to a transport implementation which is no longer used in newer branches, I'm inclined to close this here. If this reoccurs more often, we can revisit. |
This doesn't reproduce locally but has failed 5 times last week. It started failing on January 9th but I don't see relevant changes in the code recently.
https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2FM,mode:quick,to:now%2FM))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:NestedObjectMapperTests),sort:!(time,desc))
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1035 is an example
The text was updated successfully, but these errors were encountered: