Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchGroupsResolverInMemoryTests leaking threads #80305

Closed
gwbrown opened this issue Nov 3, 2021 · 16 comments · Fixed by #81197 or #81568
Closed

SearchGroupsResolverInMemoryTests leaking threads #80305

gwbrown opened this issue Nov 3, 2021 · 16 comments · Fixed by #81197 or #81568
Assignees
Labels
:Security/Security Security issues without another label Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Nov 3, 2021

Build scan: sample scan

Repro line: None given

Reproduces locally?: n/a, see above

Applicable branches: 8.0, master

Failure history: build-stats - looks like failures with this error go back to Oct. 18 at least.

Failure excerpt:

	com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.elasticsearch.xpack.security.authc.ldap.SearchGroupsResolverInMemoryTests: 	
	   1) Thread[id=124, name=Timer thread for LDAPConnection(name='test-connection-testSearchTimeoutIsFailure', not connected), state=WAITING, group=TGRP-SearchGroupsResolverInMemoryTests]	
	        at [email protected]/java.lang.Object.wait(Native Method)	
	        at [email protected]/java.lang.Object.wait(Object.java:338)	
	        at [email protected]/java.util.TimerThread.mainLoop(Timer.java:537)	
	        at [email protected]/java.util.TimerThread.run(Timer.java:516)	
		at __randomizedtesting.SeedInfo.seed([42DA9C1586F63FD5]:0)	

@gwbrown gwbrown added >test-failure Triaged test failures from CI :Security/Security Security issues without another label labels Nov 3, 2021
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Nov 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@tvernum tvernum self-assigned this Nov 5, 2021
@tvernum
Copy link
Contributor

tvernum commented Nov 5, 2021

This one has me stumped so far.
I cannot get it to fail locally despite thousands of executions. I don't know what to make of that.

I suspect it's a real issue (though maybe only in test code), but I cannot make sense of it.

The Timer thread is blocked on Object.wait
That object should get a notify when the timer is cancelled, which should happen when the connection is closed, which should happen @After the test completes.

I cannot work out which of those steps is failing, and since I can't reproduce it, I'm having trouble getting any diagnostic on it.
I suspect the next step will be to try and reproduce on a CI worker. If I can get it failing there I can possibly put in a patched version of the LDAP SDK to try and diagnose where the cancel is breaking down.

@tvernum
Copy link
Contributor

tvernum commented Nov 8, 2021

Update: I successfully reproduced the error after 100 attempts on a Debian 10 CI Worker.
I'm now retrying with a patched version of the test and the LDAP SDK that will give more detail when it fails. However, after more than 600 attempts I have not had a failure with the patched version 😿

@tvernum
Copy link
Contributor

tvernum commented Nov 9, 2021

Update: After 3000 executions it failed.

It appears to be some form of locking contention (possibly a deadlock, but I haven't gotten deep enough to know).
When we try to close the connection, it attempts to obtain a lock on itself before it clears the Timer. It seems to be blocking on synchronized(this), so it can never clear the timer. (it wasn't)

I've added more logging and will keep trying.

@henningandersen
Copy link
Contributor

Had a PR build fail with this leak error today.

@pugnascotia
Copy link
Contributor

@joegallo
Copy link
Contributor

@tvernum
Copy link
Contributor

tvernum commented Dec 1, 2021

I think I've tracked it down and raised an upstream issue pingidentity/ldapsdk#120 (it looks like a real, though very infrequent thread leak in LDAP SDK).

I'll see if there's something we can do locally to avoid this until we can get a fix from upstream.

elasticsearchmachine pushed a commit that referenced this issue Dec 2, 2021
LDAP SDK has a race condition where closing a connection while an async
search is still executing could lead to a Timer thread being orphaned.
See: pingidentity/ldapsdk#120 This commit
changes SearchGroupsResolverInMemoryTests so that it waits for the
pending async search to complete (or timeout) before returning. This
ensures that when the close the connection the Timer thread is cancelled
(and stays cancelled). Resolves: #80305
tvernum added a commit to tvernum/elasticsearch that referenced this issue Dec 2, 2021
LDAP SDK has a race condition where closing a connection while an async
search is still executing could lead to a Timer thread being orphaned.
See: pingidentity/ldapsdk#120 This commit
changes SearchGroupsResolverInMemoryTests so that it waits for the
pending async search to complete (or timeout) before returning. This
ensures that when the close the connection the Timer thread is cancelled
(and stays cancelled). Resolves: elastic#80305
elasticsearchmachine pushed a commit that referenced this issue Dec 2, 2021
LDAP SDK has a race condition where closing a connection while an async
search is still executing could lead to a Timer thread being orphaned.
See: pingidentity/ldapsdk#120 This commit
changes SearchGroupsResolverInMemoryTests so that it waits for the
pending async search to complete (or timeout) before returning. This
ensures that when the close the connection the Timer thread is cancelled
(and stays cancelled). Resolves: #80305
@iverase
Copy link
Contributor

iverase commented Dec 2, 2021

There was another failure for this test:

https://gradle-enterprise.elastic.co/s/p2nvftexltoik

@iverase iverase reopened this Dec 2, 2021
@przemekwitek
Copy link
Contributor

Another failure from today:
https://gradle-enterprise.elastic.co/s/fghfaftsjuneq

@tvernum
Copy link
Contributor

tvernum commented Dec 7, 2021

I don't think there's anyway to reliably work around the upstream issue. I'm going to have to mute the test until there's a new release of the SDK.

tvernum added a commit to tvernum/elasticsearch that referenced this issue Dec 9, 2021
The new release contains fixes for leaking threads (see elastic#80305) and
bias in round robin server sets, both of which are relevant to
Elasticsearch security.

Resolves: elastic#80305
tvernum added a commit that referenced this issue Dec 9, 2021
The new release contains fixes for leaking threads (see #80305) and
bias in round robin server sets, both of which are relevant to
Elasticsearch security.

Resolves: #80305
tvernum added a commit to tvernum/elasticsearch that referenced this issue Dec 9, 2021
The new release contains fixes for leaking threads (see elastic#80305) and
bias in round robin server sets, both of which are relevant to
Elasticsearch security.

Resolves: elastic#80305
@andreidan
Copy link
Contributor

I believe this is still failing in 8.0
https://gradle-enterprise.elastic.co/s/mvs3hpjghupym

@andreidan andreidan reopened this Dec 21, 2021
@jkakavas
Copy link
Member

we just need to merge #81581, I kicked the CI tires again

@tvernum
Copy link
Contributor

tvernum commented Dec 21, 2021

Sorry about the delayed backport. Log4j, and all that jazz.

elasticsearchmachine pushed a commit that referenced this issue Dec 23, 2021
The new release contains fixes for leaking threads (see #80305) and
bias in round robin server sets, both of which are relevant to
Elasticsearch security.

Resolves: #80305

Co-authored-by: Elastic Machine <[email protected]>
@tvernum
Copy link
Contributor

tvernum commented Dec 23, 2021

Resolved by #81581

@tvernum tvernum closed this as completed Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Security Security issues without another label Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet