Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper thread detaching causes deadlock in LDR on Windows 10+ #1479

Open
apavlyutkin opened this issue Nov 15, 2022 · 12 comments
Open

Improper thread detaching causes deadlock in LDR on Windows 10+ #1479

apavlyutkin opened this issue Nov 15, 2022 · 12 comments

Comments

@apavlyutkin
Copy link

JNA: any
JVM: any OpenJDK based
OS: Windows 10+

Recently I analyzed a hang in a customer application raised against Zulu11.54+26-SA (before the same customer complained about a jdk-8 based JVM assemlage). The dump for the issue had the following stacktrace

.306  Id: 27f4.3f70 Suspend: 0 Teb: 00000000`54696000 Unfrozen "foo"
      Start: .306  Id: 27f4.3f70 Suspend: 0 Teb: 00000000`54696000 Unfrozen "foo"
      Start: ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)
      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site

      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00 00000001`7f60e698 00007ffb`6a458933     ntdll!NtWaitForSingleObject+0x14
00000001`7f60e698 00007ffb`6a458933     ntdll!NtWaitForSingleObject+0x14
01 01 00000001`7f60e6a0 00007ffb`3bc1226f     KERNELBASE!WaitForSingleObjectEx+0x93
00000001`7f60e6a0 00007ffb`3bc1226f     KERNELBASE!WaitForSingleObjectEx+0x93
02 02 00000001`7f60e740 00007ffb`3bbe61dc     jvm!os::PlatformEvent::park+0x8f00000001`7f60e740 00007ffb`3bbe61dc     jvm!os::PlatformEvent::park+0x8f [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\os\windows\os_windows.cpp @ 5370]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\os\windows\os_windows.cpp @ 5370] 

03 03 (Inline Function) --------`--------     jvm!ParkCommon+0x8(Inline Function) --------`--------     jvm!ParkCommon+0x8 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 404]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 404] 

04 04 00000001`7f60e770 00007ffb`3bbe6a81     jvm!Monitor::ILock+0x5c00000001`7f60e770 00007ffb`3bbe6a81     jvm!Monitor::ILock+0x5c [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 465]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 465] 

05 05 (Inline Function) --------`--------     jvm!Monitor::lock_without_safepoint_check+0x8(Inline Function) --------`--------     jvm!Monitor::lock_without_safepoint_check+0x8 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 938]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 938] 

06 06 00000001`7f60e7a0 00007ffb`3bc7ad23     jvm!Monitor::lock_without_safepoint_check+0x3100000001`7f60e7a0 00007ffb`3bc7ad23     jvm!Monitor::lock_without_safepoint_check+0x31 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 944]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\mutex.cpp @ 944] 

07 07 00000001`7f60e7d0 00007ffb`3bc7d328     jvm!SafepointSynchronize::block+0x17300000001`7f60e7d0 00007ffb`3bc7d328     jvm!SafepointSynchronize::block+0x173 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepoint.cpp @ 939]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepoint.cpp @ 939] 

08 08 00000001`7f60e810 00007ffb`3bd5daa1     jvm!SafepointMechanism::block_if_requested_slow+0x1800000001`7f60e810 00007ffb`3bd5daa1     jvm!SafepointMechanism::block_if_requested_slow+0x18 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepointmechanism.cpp @ 91]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepointmechanism.cpp @ 91] 

09 09 (Inline Function) --------`--------     jvm!SafepointMechanism::block_if_requested+0x1f(Inline Function) --------`--------     jvm!SafepointMechanism::block_if_requested+0x1f [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepointmechanism.inline.hpp @ 62]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\safepointmechanism.inline.hpp @ 62] 

0a 0a 00000001`7f60e840 00007ffb`3b794f08     jvm!JavaThread::check_safepoint_and_suspend_for_native_trans+0xe100000001`7f60e840 00007ffb`3b794f08     jvm!JavaThread::check_safepoint_and_suspend_for_native_trans+0xe1 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\thread.cpp @ 2524]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\thread.cpp @ 2524] 

0b 0b 00000001`7f60fb20 00007ffb`3ba266a5     jvm!ThreadStateTransition::transition_from_native+0x8800000001`7f60fb20 00007ffb`3ba266a5     jvm!ThreadStateTransition::transition_from_native+0x88 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\interfacesupport.inline.hpp @ 175]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\runtime\interfacesupport.inline.hpp @ 175] 

0c 0c 00000001`7f60fb50 00007ffb`446e1115     jvm!jni_DetachCurrentThread+0x6500000001`7f60fb50 00007ffb`446e1115     jvm!jni_DetachCurrentThread+0x65 [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\prims\jni.cpp @ 4359]  [c:\jenkins\workspace\zulu11-build-win64\zulu-src.git\src\hotspot\share\prims\jni.cpp @ 4359] 

0d 0d 00000001`7f60fb80 00007ffb`446e11ef     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0x2665
00000001`7f60fb80 00007ffb`446e11ef     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0x2665
0e 0e 00000001`7f60fbe0 00007ffb`446e956f     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0x273f
00000001`7f60fbe0 00007ffb`446e956f     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0x273f
0f 0f 00000001`7f60fc20 00007ffb`6de95d37     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0xaabf
00000001`7f60fc20 00007ffb`6de95d37     jna6833526131579377318!Java_com_sun_jna_Native_getDirectByteBuffer__Lcom_sun_jna_Pointer_2JJJ+0xaabf
10 10 00000001`7f60fc80 00007ffb`6de933f4     ntdll!LdrpCallInitRoutine+0x6f
00000001`7f60fc80 00007ffb`6de933f4     ntdll!LdrpCallInitRoutine+0x6f
11 11 00000001`7f60fcf0 00007ffb`6deca33e     ntdll!LdrShutdownThread+0x154
00000001`7f60fcf0 00007ffb`6deca33e     ntdll!LdrShutdownThread+0x154
12 12 00000001`7f60fdf0 00007ffb`6a49992a     ntdll!RtlExitUserThread+0x3e
00000001`7f60fdf0 00007ffb`6a49992a     ntdll!RtlExitUserThread+0x3e
13 13 00000001`7f60fe30 00007ffb`6a092794     KERNELBASE!FreeLibraryAndExitThread+0x4a
00000001`7f60fe30 00007ffb`6a092794     KERNELBASE!FreeLibraryAndExitThread+0x4a
14 14 00000001`7f60fe60 00007ffb`6a0e0809     ucrtbase!common_end_thread+0xa4
00000001`7f60fe60 00007ffb`6a0e0809     ucrtbase!common_end_thread+0xa4
15 15 00000001`7f60fe90 00007ffb`4593f72c     ucrtbase!endthreadex+0x9
00000001`7f60fe90 00007ffb`4593f72c     ucrtbase!endthreadex+0x9
16 16 00000001`7f60fec0 00007ffb`6a09268a     glib_2_0_0!g_rw_lock_writer_unlock+0x33c
00000001`7f60fec0 00007ffb`6a09268a     glib_2_0_0!g_rw_lock_writer_unlock+0x33c
17 17 00000001`7f60ff00 00007ffb`6b0b7974     ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x3a
00000001`7f60ff00 00007ffb`6b0b7974     ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x3a
18 18 00000001`7f60ff30 00007ffb`6deca2d1     kernel32!BaseThreadInitThunk+0x14
00000001`7f60ff30 00007ffb`6deca2d1     kernel32!BaseThreadInitThunk+0x14
19 19 00000001`7f60ff60 00000000`00000000     ntdll!RtlUserThreadStart+0x21
00000001`7f60ff60 00000000`00000000     ntdll!RtlUserThreadStart+0x21

and I found the code causing such stack trace callback.c:587

static void dispose_thread_data(void* data) {
  thread_storage* tls = (thread_storage*)data;
  JavaVM* jvm = tls->jvm;
  JNIEnv* env;
  int is_attached = (*jvm)->GetEnv(jvm, (void*)&env, JNI_VERSION_1_4) == JNI_OK;
  jboolean detached = JNI_TRUE;
  if (is_attached) {
    if ((*jvm)->DetachCurrentThread(jvm) != 0) {
      fprintf(stderr, "JNA: could not detach native thread (automatic)\n");
      detached = JNI_FALSE;
    }
  }
  if (tls->termination_flag && detached) {
    *(tls->termination_flag) = JNI_TRUE;
  }
  free(data);
}

BOOL WINAPI DllMain(HINSTANCE hDLL, DWORD fdwReason, LPVOID lpvReserved) {
  switch (fdwReason) {
  ...
  case DLL_THREAD_DETACH: {
    thread_storage* tls = (thread_storage*)TlsGetValue(tls_thread_data_key);
    if (tls) {
      dispose_thread_data(tls);
      TlsSetValue(tls_thread_data_key, 0);
    }
    break;
  }
  ...
  return TRUE;
}

The Invokation API tells explicitly

A native thread attached to the VM must call DetachCurrentThread() to detach itself BEFORE exiting.

The problem here is that you cannot consider a thread to detach as completely done. If at the moment JVM runs a safepoint synchronization the thread gets lock and stays awaiting until all other threads reach the safepoint (see the callstack). That still might work before Windows 10 introduced parallel algorithm of DLL loading/releasing, but now... I do not have an access to Windows source code, but ReactOS project shows us that LDR calls DllMain(..., DLL_THREAD_DETACH, ...) under taken critical section

NTSTATUS
NTAPI
LdrShutdownThread(VOID)
{
    ...
    /* Get the Ldr Lock */
    RtlEnterCriticalSection(&LdrpLoaderLock);
    ...
    /* Start at the end */
    ListHead = &Peb->Ldr->InInitializationOrderModuleList;
    NextEntry = ListHead->Blink;
    while (NextEntry != ListHead)
    {
        ...
                            LdrpCallInitRoutine(EntryPoint,
                                                 LdrEntry->DllBase,
                                                 DLL_THREAD_DETACH,
                                                 NULL);
        ...
    }
    ...
    /* Free TLS */
    LdrpFreeTls();
    RtlLeaveCriticalSection(&LdrpLoaderLock);
    ...
}

that locks LDRP workers, so other threads loading/releasing DLL's at the moment get locked just like

. 76  Id: 27f4.2de0 Suspend: 0 Teb: 00000000`0023e000 Unfrozen
      Start: . 76  Id: 27f4.2de0 Suspend: 0 Teb: 00000000`0023e000 Unfrozen
      Start: mswsock!SockAsyncThread (00007ffb`696d0460)mswsock!SockAsyncThread (00007ffb`696d0460)
      Priority: 1  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site

      Priority: 1  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00 00000000`415efd18 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
00000000`415efd18 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
01 01 00000000`415efd20 00007ffb`6de93338     ntdll!LdrpDrainWorkQueue+0x15d
00000000`415efd20 00007ffb`6de93338     ntdll!LdrpDrainWorkQueue+0x15d
02 02 00000000`415efd60 00007ffb`6deca33e     ntdll!LdrShutdownThread+0x98
00000000`415efd60 00007ffb`6deca33e     ntdll!LdrShutdownThread+0x98
03 03 00000000`415efe60 00007ffb`6a49992a     ntdll!RtlExitUserThread+0x3e
00000000`415efe60 00007ffb`6a49992a     ntdll!RtlExitUserThread+0x3e
04 04 00000000`415efea0 00007ffb`696d0569     KERNELBASE!FreeLibraryAndExitThread+0x4a
00000000`415efea0 00007ffb`696d0569     KERNELBASE!FreeLibraryAndExitThread+0x4a
05 05 00000000`415efed0 00007ffb`6b0b7974     mswsock!SockAsyncThread+0x109
00000000`415efed0 00007ffb`6b0b7974     mswsock!SockAsyncThread+0x109
06 06 00000000`415eff30 00007ffb`6deca2d1     kernel32!BaseThreadInitThunk+0x14
00000000`415eff30 00007ffb`6deca2d1     kernel32!BaseThreadInitThunk+0x14
07 07 00000000`415eff60 00000000`00000000     ntdll!RtlUserThreadStart+0x21
00000000`415eff60 00000000`00000000     ntdll!RtlUserThreadStart+0x21

.319  Id: 27f4.3e58 Suspend: 0 Teb: 00000000`54700000 Unfrozen
      Start: .319  Id: 27f4.3e58 Suspend: 0 Teb: 00000000`54700000 Unfrozen
      Start: ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)
      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site

      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00 00000001`5204f898 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
00000001`5204f898 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
01 01 00000001`5204f8a0 00007ffb`6de935f7     ntdll!LdrpDrainWorkQueue+0x15d
00000001`5204f8a0 00007ffb`6de935f7     ntdll!LdrpDrainWorkQueue+0x15d
02 02 00000001`5204f8e0 00007ffb`6dee8625     ntdll!LdrpInitializeThread+0x8b
00000001`5204f8e0 00007ffb`6dee8625     ntdll!LdrpInitializeThread+0x8b
03 03 00000001`5204f9c0 00007ffb`6dee8203     ntdll!LdrpInitialize+0x409
00000001`5204f9c0 00007ffb`6dee8203     ntdll!LdrpInitialize+0x409
04 04 00000001`5204fa60 00007ffb`6dee81ae     ntdll!LdrpInitialize+0x3b
00000001`5204fa60 00007ffb`6dee81ae     ntdll!LdrpInitialize+0x3b
05 05 00000001`5204fa90 00000000`00000000     ntdll!LdrInitializeThunk+0xe
00000001`5204fa90 00000000`00000000     ntdll!LdrInitializeThunk+0xe

and cannot get to the safepoint rendezvous. We get the deadlock.

We provided a workaround to the customer but the changes are very tricky and risky and they touch JDK logic that stays immutable for years, so there is not a chance to push the changes to OpenJDK. Please fix the issue on your side. Thank you

@wilx
Copy link
Contributor

wilx commented Nov 15, 2022

OK, I can't say I understand all this completely but here it is: Looking at the ReactOS sources, I think we should try using fibers support FlsAlloc() family of functions instead of TlsAlloc() family of functions. The FlsAlloc() allows for a callback unlike TlsAlloc(). See https://github.com/reactos/reactos/blob/4dd734e92c1db15083e03b9d88e0cb8243386b7d/dll/ntdll/ldr/ldrinit.c#L1098-L1200, this is where the loader lock is held. But the lock is not held for the fibers callback later in the same function at https://github.com/reactos/reactos/blob/4dd734e92c1db15083e03b9d88e0cb8243386b7d/dll/ntdll/ldr/ldrinit.c#L1209-L1235

@apavlyutkin
Copy link
Author

IMHO you should only satisfy the expection of Invokation API and make sure that the thread is already detached from the VM on all control paths (including exception handling) before CRT calls FreeLibraryAndExitThread().

@neilcsmith-net
Copy link
Contributor

@wilx FlsAlloc() was my original suggestion for this (here and here). That was 10 years ago, and there were reasons for not using it then, I think partly to do with not being supported everywhere it needed to be at the time? It's what I meant on list that revisiting some of that discussion might be useful.

Panama also uses thread local destructors to manage thread detachment - eg. see here This is done at the language level, which I think could be considered now? Except being careful that however it works doesn't end up doing exactly what is happening now - eg. this.

@apavlyutkin using TLS destructors for thread detachment should be in line with the requirements AFAIK. Having the lock in DllMain perhaps less ideal! Unlike Panama, JNA does at least allow this behaviour to be customized and supports explicit detachment (something I suggested Panama should do too). The detachment logic hit here was meant as a fallback. This problem is possibly best addressed in your customer's application or its libraries.

I'd be interested to see more of that stack trace - what is the thread deadlocked in the VM with the detaching thread actually trying to do here?

@apavlyutkin
Copy link
Author

apavlyutkin commented Nov 17, 2022

I'd be interested to see more of that stack trace - what is the thread deadlocked in the VM with the detaching thread actually trying to do here?

Here is the sample of deadlocked VM thread

. 66  Id: 27f4.1da4 Suspend: 0 Teb: 00000000`0021c000 Unfrozen
      Start: . 66  Id: 27f4.1da4 Suspend: 0 Teb: 00000000`0021c000 Unfrozen
      Start: ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)> (00007ffb`6a092650)
      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site

      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00 00000000`3b5bea68 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
00000000`3b5bea68 00007ffb`6de983f5     ntdll!NtWaitForSingleObject+0x14
01 01 00000000`3b5bea70 00007ffb`6de9684a     ntdll!LdrpDrainWorkQueue+0x15d
00000000`3b5bea70 00007ffb`6de9684a     ntdll!LdrpDrainWorkQueue+0x15d
02 02 00000000`3b5beab0 00007ffb`6de93238     ntdll!LdrpLoadDllInternal+0xc2
00000000`3b5beab0 00007ffb`6de93238     ntdll!LdrpLoadDllInternal+0xc2
03 03 00000000`3b5beb30 00007ffb`6de92794     ntdll!LdrpLoadDll+0xa8
00000000`3b5beb30 00007ffb`6de92794     ntdll!LdrpLoadDll+0xa8
04 04 00000000`3b5bece0 00007ffb`6a43f3a1     ntdll!LdrLoadDll+0xe4
00000000`3b5bece0 00007ffb`6a43f3a1     ntdll!LdrLoadDll+0xe4
05 05 00000000`3b5bedd0 00007ffb`6c2670b6     KERNELBASE!LoadLibraryExW+0x161
00000000`3b5bedd0 00007ffb`6c2670b6     KERNELBASE!LoadLibraryExW+0x161
06 06 00000000`3b5bee40 00007ffb`6df13514     user32!_ClientLoadLibrary+0xa6
00000000`3b5bee40 00007ffb`6df13514     user32!_ClientLoadLibrary+0xa6
07 07 00000000`3b5befe0 00007ffb`6a1c1464     ntdll!KiUserCallbackDispatcherContinue
00000000`3b5befe0 00007ffb`6a1c1464     ntdll!KiUserCallbackDispatcherContinue
08 08 00000000`3b5bf0e8 00007ffb`6c278a23     win32u!NtUserCallNextHookEx+0x14
00000000`3b5bf0e8 00007ffb`6c278a23     win32u!NtUserCallNextHookEx+0x14
09 09 00000000`3b5bf0f0 00007ffb`6c284656     user32!CallNextHookEx+0x163
00000000`3b5bf0f0 00007ffb`6c284656     user32!CallNextHookEx+0x163
0a 0a 00000000`3b5bf140 00007ffb`6c282ae9     user32!DispatchHookW+0xf6
00000000`3b5bf140 00007ffb`6c282ae9     user32!DispatchHookW+0xf6
0b 0b 00000000`3b5bf1d0 00007ffb`6c282a75     user32!CallHookWithSEH+0x29
00000000`3b5bf1d0 00007ffb`6c282a75     user32!CallHookWithSEH+0x29
0c 0c 00000000`3b5bf220 00007ffb`6df13514     user32!__fnHkINLPMSG+0x55
00000000`3b5bf220 00007ffb`6df13514     user32!__fnHkINLPMSG+0x55
0d 0d 00000000`3b5bf280 00007ffb`6a1c1164     ntdll!KiUserCallbackDispatcherContinue
00000000`3b5bf280 00007ffb`6a1c1164     ntdll!KiUserCallbackDispatcherContinue
0e 0e 00000000`3b5bf338 00007ffb`6c28409d     win32u!NtUserGetMessage+0x14
00000000`3b5bf338 00007ffb`6c28409d     win32u!NtUserGetMessage+0x14
0f 0f 00000000`3b5bf340 00007ffb`5651a733     user32!GetMessageW+0x2d
00000000`3b5bf340 00007ffb`5651a733     user32!GetMessageW+0x2d
10 10 00000000`3b5bf3a0 00000000`0948f777     glass!Java_com_sun_glass_ui_win_WinApplication__1runLoop+0x63
00000000`3b5bf3a0 00000000`0948f777     glass!Java_com_sun_glass_ui_win_WinApplication__1runLoop+0x63
11 11 00000000`3b5bf400 00000000`3b056e60     0x948f777
00000000`3b5bf400 00000000`3b056e60     0x948f777
12 12 00000000`3b5bf408 00000000`3b5bfa50     0x3b056e60
00000000`3b5bf408 00000000`3b5bfa50     0x3b056e60
13 13 00000000`3b5bf410 00000000`3b5bfa50     0x3b5bfa50
00000000`3b5bf410 00000000`3b5bfa50     0x3b5bfa50
14 14 00000000`3b5bf418 00000000`00000000     0x3b5bfa50
00000000`3b5bf418 00000000`00000000     0x3b5bfa50

The frames on the bottom are Java frames. So, should I respond to the customer that detaching of the thread is their responsibility?

Thank you

@neilcsmith-net
Copy link
Contributor

Are you sure that's the thread blocked trying to reach a safepoint?

So, should I respond to the customer that detaching of the thread is their responsibility?

I cannot speak for the JNA project. My opinion is as developer of the library your customer is using on top of JNA (he posted on the JNA mailing list), and as the person who requested this feature in JNA originally (for that library amongst others).

@wilx
Copy link
Contributor

wilx commented Nov 17, 2022

I still don't see the lock order reversal, that is necessary for the deadlock to happen, in the provided call stacks. Does anyone else?

Maybe your JRE vendor could provide all call stacks for all the threads in the process?

@apavlyutkin
Copy link
Author

I'm exactly JRE vendor representative, but I'm not sure if I can publish full stack trace. Let me check it with the customer.

The deadlock gets completed here

https://github.com/openjdk/jdk11u-dev/blob/1f173716812a570fe6b6eaf685ec90f66a840792/src/hotspot/share/runtime/safepoint.cpp#L289

This is VM thread running safepoint synchronization. The loop awaits until all Java threads get to the safepoint but that does not happen

Dump D__ZUC-932_initial_OwlView64 DMP - WinDbg_10 0 22621 1 AMD64  11_18_2022 1_08_03 PM

There are still 2 threads that prevent the synchronization completing. The assumption (@neilcsmith-net you're right this is just an assumption) is that these threads are stalled in LDR that stays locked in turn by detaching thread until safepoint operation comes to the end

https://github.com/openjdk/jdk11u-dev/blob/1f173716812a570fe6b6eaf685ec90f66a840792/src/hotspot/share/runtime/safepoint.cpp#L911

Ok, I'm gonna create a diagnostic JRE build that will log "still running" threads and preserve them to the crash dump and ask the customer to capture another dump. Thank you

@neilcsmith-net
Copy link
Contributor

neilcsmith-net commented Nov 18, 2022

@apavlyutkin my assumption is based on the fact that a thread in native code through a JNI call should normally be at a safepoint already?!

@apavlyutkin
Copy link
Author

apavlyutkin commented Nov 30, 2022

@apavlyutkin my assumption is based on the fact that a thread in native code through a JNI call should normally be at a safepoint already?!

Right, the native thread is already on the safepoint, it's Ok. It stays waiting for completing of the sync to die. The problem is that the thread still holds the lock over LDR, so OTHER threads trying to load/release DLL's at the moment are blocked and cannot be synced, so safepoint sync cannot complete.

Look at the stack

OwlView64.stack.txt

is that typical to see so many threads frozen in ntdll!LdrpDrainWorkQueue+0x15d?

@neilcsmith-net
Copy link
Contributor

@apavlyutkin I was talking about the JVM thread in the Java_com_sun_glass_ui_win_WinApplication__1runLoop call. That should be at a safepoint already as it's in a JNI call. Surely you need another thread that is frozen in LdrpDrainWorkQueue but not already at a safepoint for this to be a problem? In which case, do you know which thread in here this is and what it's trying to do?

There's also a number of threads in glib with locks in there too, some of which are in the VM, at least one of which is also waiting on LdrpDrainWorkQueue. If your client wants help with the GStreamer/GLib interaction, he knows where to find us. There's a limit to how much wading through stack traces I'm doing in my free time! 😄

@apavlyutkin
Copy link
Author

apavlyutkin commented Dec 6, 2022

Finally got the results of a diagnostic build logging the threads the safepoint is waiting for. All is as expected, the safepoint cannot sync 25710

[12167.781s][info ][safepoint] Still running: thread =  000000019B135800; thread_id = 25710; thread_name = Keep-Alive-Timer
[12167.781s][trace][safepoint] Thread: 0x000000019b135800  [0x25710] State: _running _has_called_back 0 _at_poll_safepoint 0
[12167.783s][info ][safepoint] Still running: thread =  000000019B135800; thread_id = 25710; thread_name = Keep-Alive-Timer
[12167.783s][trace][safepoint] Thread: 0x000000019b135800  [0x25710] State: _running _has_called_back 0 _at_poll_safepoint 0
[12167.785s][info ][safepoint] Still running: thread =  000000019B135800; thread_id = 25710; thread_name = Keep-Alive-Timer
[12167.785s][trace][safepoint] Thread: 0x000000019b135800  [0x25710] State: _running _has_called_back 0 _at_poll_safepoint 0

because it hangs in LDR

.284  Id: 34530.25710 Suspend: 0 Teb: 00000000`d3d16000 Unfrozen
      Start: ucrtbase!thread_start<unsigned int (__cdecl*)(void *),1> (00007ffa`bae61b70)
      Priority: 1  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00000000`3b58f898 00007ffa`bd04ea52     ntdll!NtWaitForSingleObject+0x14
01 00000000`3b58f8a0 00007ffa`bd0077c3     ntdll!LdrpDrainWorkQueue+0x15e
02 00000000`3b58f8e0 00007ffa`bd064f84     ntdll!LdrpInitializeThread+0x8b
03 00000000`3b58f9c0 00007ffa`bd064b63     ntdll!LdrpInitialize+0x408
04 00000000`3b58fa60 00007ffa`bd064b0e     ntdll!LdrpInitialize+0x3b
05 00000000`3b58fa90 00000000`00000000     ntdll!LdrInitializeThunk+0xe

and the second sample

[12182.719s][trace][safepoint] Thread: 0x00000000e5aeb800  [0x416ac] State: _running _has_called_back 0 _at_poll_safepoint 0
[12182.721s][info ][safepoint] Still running: thread =  00000000E5AEB800; thread_id = 416ac; thread_name = tyrus-70-thread-1
[12182.721s][trace][safepoint] Thread: 0x00000000e5aeb800  [0x416ac] State: _running _has_called_back 0 _at_poll_safepoint 0
[12182.723s][info ][safepoint] Still running: thread =  00000000E5AEB800; thread_id = 416ac; thread_name = tyrus-70-thread-1
[12182.723s][trace][safepoint] Thread: 0x00000000e5aeb800  [0x416ac] State: _running _has_called_back 0 _at_poll_safepoint 0
[12182.725s][info ][safepoint] Still running: thread =  00000000E5AEB800; thread_id = 416ac; thread_name = tyrus-70-thread-1
.253  Id: 33c10.416ac Suspend: 0 Teb: 00000000`c5e38000 Unfrozen
      Start: ucrtbase!thread_start<unsigned int (__cdecl*)(void *),1> (00007ffa`bae61b70)
      Priority: 0  Priority class: 32  Affinity: fff
 # Child-SP          RetAddr               Call Site
00 00000000`4dbef898 00007ffa`bd04ea52     ntdll!NtWaitForSingleObject+0x14
01 00000000`4dbef8a0 00007ffa`bd0077c3     ntdll!LdrpDrainWorkQueue+0x15e
02 00000000`4dbef8e0 00007ffa`bd064f84     ntdll!LdrpInitializeThread+0x8b
03 00000000`4dbef9c0 00007ffa`bd064b63     ntdll!LdrpInitialize+0x408
04 00000000`4dbefa60 00007ffa`bd064b0e     ntdll!LdrpInitialize+0x3b
05 00000000`4dbefa90 00000000`00000000     ntdll!LdrInitializeThunk+0xe 

I still think that the issue must be fixed in JNA. May be detaching of a thread from the callback is in line with the requirements, but requirements tend to become obsolete like everything else. Beggining from Win10 that may cause LDR deadlocks and therefore makes no sense

@neilcsmith-net
Copy link
Contributor

I still think that the issue must be fixed in JNA.

Well I'm sure the current JNA team would appreciate your contributed fix! Along with all required tests. Of course, the options for alternative thread-local storage behaviour are minimal , but FlsAlloc might now be an option.

A few points (I don't intend on following up further on this here)

  • this is fallback behaviour in JNA, and can only happen at thread exit. It is opt-in and controllable. Its use in both the GStreamer bindings and your client's code is configurable.
  • even if you control its use in JNA, you also need to be aware that the GLib code underlying GStreamer is probably using the exact same DllMain mechanism for its thread-local storage, which might also call into the bindings and then into the JVM with the same lock.
  • JNA is definitely not the only library that calls into the JVM inside DllMain for similar reasons. I've seen that pattern in a number of other things recently. Which might lead to a suggestion the fix belongs in the JVM!
  • Is it always the Keep-Alive-Timer thread that's blocked from reaching a safepoint? You still haven't answered the question I posed earlier - what is it doing that it is blocked on the LDR lock without being at a safepoint already? Anything on the native side of the JNI barrier is meant to be at a safepoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants