-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improper thread detaching causes deadlock in LDR on Windows 10+ #1479
Comments
OK, I can't say I understand all this completely but here it is: Looking at the ReactOS sources, I think we should try using fibers support |
IMHO you should only satisfy the expection of Invokation API and make sure that the thread is already detached from the VM on all control paths (including exception handling) before CRT calls FreeLibraryAndExitThread(). |
@wilx Panama also uses thread local destructors to manage thread detachment - eg. see here This is done at the language level, which I think could be considered now? Except being careful that however it works doesn't end up doing exactly what is happening now - eg. this. @apavlyutkin using TLS destructors for thread detachment should be in line with the requirements AFAIK. Having the lock in DllMain perhaps less ideal! Unlike Panama, JNA does at least allow this behaviour to be customized and supports explicit detachment (something I suggested Panama should do too). The detachment logic hit here was meant as a fallback. This problem is possibly best addressed in your customer's application or its libraries. I'd be interested to see more of that stack trace - what is the thread deadlocked in the VM with the detaching thread actually trying to do here? |
Here is the sample of deadlocked VM thread
The frames on the bottom are Java frames. So, should I respond to the customer that detaching of the thread is their responsibility? Thank you |
Are you sure that's the thread blocked trying to reach a safepoint?
I cannot speak for the JNA project. My opinion is as developer of the library your customer is using on top of JNA (he posted on the JNA mailing list), and as the person who requested this feature in JNA originally (for that library amongst others). |
I still don't see the lock order reversal, that is necessary for the deadlock to happen, in the provided call stacks. Does anyone else? Maybe your JRE vendor could provide all call stacks for all the threads in the process? |
I'm exactly JRE vendor representative, but I'm not sure if I can publish full stack trace. Let me check it with the customer. The deadlock gets completed here This is VM thread running safepoint synchronization. The loop awaits until all Java threads get to the safepoint but that does not happen There are still 2 threads that prevent the synchronization completing. The assumption (@neilcsmith-net you're right this is just an assumption) is that these threads are stalled in LDR that stays locked in turn by detaching thread until safepoint operation comes to the end Ok, I'm gonna create a diagnostic JRE build that will log "still running" threads and preserve them to the crash dump and ask the customer to capture another dump. Thank you |
@apavlyutkin my assumption is based on the fact that a thread in native code through a JNI call should normally be at a safepoint already?! |
Right, the native thread is already on the safepoint, it's Ok. It stays waiting for completing of the sync to die. The problem is that the thread still holds the lock over LDR, so OTHER threads trying to load/release DLL's at the moment are blocked and cannot be synced, so safepoint sync cannot complete. Look at the stack is that typical to see so many threads frozen in |
@apavlyutkin I was talking about the JVM thread in the There's also a number of threads in |
Finally got the results of a diagnostic build logging the threads the safepoint is waiting for. All is as expected, the safepoint cannot sync 25710
because it hangs in LDR
and the second sample
I still think that the issue must be fixed in JNA. May be detaching of a thread from the callback is in line with the requirements, but requirements tend to become obsolete like everything else. Beggining from Win10 that may cause LDR deadlocks and therefore makes no sense |
Well I'm sure the current JNA team would appreciate your contributed fix! Along with all required tests. Of course, the options for alternative thread-local storage behaviour are minimal , but A few points (I don't intend on following up further on this here)
|
JNA: any
JVM: any OpenJDK based
OS: Windows 10+
Recently I analyzed a hang in a customer application raised against Zulu11.54+26-SA (before the same customer complained about a jdk-8 based JVM assemlage). The dump for the issue had the following stacktrace
and I found the code causing such stack trace callback.c:587
The Invokation API tells explicitly
The problem here is that you cannot consider a thread to detach as completely done. If at the moment JVM runs a safepoint synchronization the thread gets lock and stays awaiting until all other threads reach the safepoint (see the callstack). That still might work before Windows 10 introduced parallel algorithm of DLL loading/releasing, but now... I do not have an access to Windows source code, but ReactOS project shows us that LDR calls
DllMain(..., DLL_THREAD_DETACH, ...)
under taken critical sectionthat locks LDRP workers, so other threads loading/releasing DLL's at the moment get locked just like
and cannot get to the safepoint rendezvous. We get the deadlock.
We provided a workaround to the customer but the changes are very tricky and risky and they touch JDK logic that stays immutable for years, so there is not a chance to push the changes to OpenJDK. Please fix the issue on your side. Thank you
The text was updated successfully, but these errors were encountered: