Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DoStackSnapshot (async) deadlock #32286

Closed
iskiselev opened this issue Feb 14, 2020 · 3 comments
Closed

DoStackSnapshot (async) deadlock #32286

iskiselev opened this issue Feb 14, 2020 · 3 comments
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@iskiselev
Copy link

We were affected by deadlock introduced with profiler async call DoStackSnapshot. We have seen it on .Net Framework, but it should be applicable to any version of .Net Core on Windows too (only Windows, as other OS do not support async DoStackSnapshot).

Here is condition. Profiler thread suspended application thread before calling DoStackSnapshot.
Application thread has next call stack:

00 ntdll!NtProtectVirtualMemory+0xa
01 ntdll!LdrpChangeMrdataProtection+0x4e
02 ntdll!LdrProtectMrdata+0x46
03 ntdll!RtlInsertInvertedFunctionTable+0xdf
04 ntdll!LdrpProcessMappedModule+0x1fd
05 ntdll!LdrpMapAndSnapModules+0xa0
06 ntdll!LdrpPrepareModuleForExecution+0xcc
07 ntdll!LdrpLoadDll+0x36b
08 ntdll!LdrLoadDll+0x99
09 KERNELBASE!LoadLibraryExW+0xca
0a crypt32!LoadDll+0x8c
0b crypt32!CryptGetOIDFunctionAddress+0x2d8
0c crypt32!_CallIsMyFileType2+0x57
0d crypt32!_QueryRegisteredIsMyFileType+0xd4
0e crypt32!CryptSIPRetrieveSubjectGuid+0x384
0f crypt32!GetEmbeddedPKCS7+0x71
10 crypt32!I_CryptQueryObject+0x176
11 crypt32!CryptQueryObject+0x245
12 clr!COMX509Certificate::QueryCertFileType+0x1b4
13 [HelperMethodFrame_1OBJ: 00000051aed7e398] System.Security.Cryptography.X509Certificates.X509Utils._QueryCertFileType(System.String)
14 System.Security.Cryptography.X509Certificates.X509Certificate.LoadCertificateFromFile(System.String, System.Object, System.Security.Cryptography.X509Certificates.X509KeyStorageFlags)
15 System.Security.Cryptography.X509Certificates.X509Certificate.CreateFromCertFile(System.String)

Profiler thread has next call stack:

00 ntdll!NtWaitForAlertByThreadId+0xa
01 ntdll!RtlAcquireSRWLockShared+0x128
02 ntdll!RtlpxLookupFunctionTable+0x40
03 ntdll!RtlLookupFunctionEntry+0x153
04 clr!LazyMachState::unwindLazyState+0x119
05 clr!HelperMethodFrame::GetFunction+0xf6
06 clr!StackFrameIterator::NextRaw+0xa5f
07 clr!StackFrameIterator::Filter+0x27e
08 clr!StackFrameIterator::Init+0x185
09 clr!Thread::StackWalkFramesEx+0x86
0a clr!Thread::StackWalkFrames+0xbe
0b clr!ProfToEEInterfaceImpl::DoStackSnapshotHelper+0x76
0c clr!ProfToEEInterfaceImpl::DoStackSnapshot+0x31c

Application thread holds CritSec ntdll!LdrpLoaderLock lock and looks like it also holds SRW lock, that probably was acquired in ntdll!RtlpxLookupFunctionTable. Profiler thread tries to acquire the same SRW lock in ntdll!RtlpxLookupFunctionTable. In our case application thread acquired some more locks, resulting in full dead-lock of all application thread.

Looks like the application calls CreateFromCertFile often, which calls KERNELBASE!LoadLibraryExW internally on each attempt. So, this application reveal issue easily - while for most other application dead lock though loading library will be low-probability, a libraries usually are loaded finite amount of times.

We are discussing possible workarounds for this problem. Is it possible to predict if it is safe to call DoStackSnapshot at some particular moment? Will it be safe to move calling DoStackSnapshot into separate thread and kill that thread if no progress have been after some time? If it is not safe - may it introduce data corruption/additional locks or only some small memory leaks?

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 14, 2020
@AaronRobinsonMSFT
Copy link
Member

/cc @noahfalk @davmason

@davmason
Copy link
Member

Hi @iskiselev, I am not super surprised you are seeing this sort of issue, the asynchronous mode of pausing a thread at a random state is pretty dangerous. Here are the ways I know to try and alleviate this:

  • As you suggest, you can monitor your snapshot thread and kill it if it pauses for too long. As long as you make sure to resume the thread you were trying to walk there shouldn't be any long term issues
  • One method we've used to solve similar problems in the past is to have a "canary" thread that can receive messages telling it to try and take locks. Then once you suspend a thread for DoStackSnapshot you can tell the canary thread the try and take the locks you're worried about (in this case loader, but there are other OS locks you can run in to the same issue with). If the thread reports that it can take the lock, you likely will be able to as well. If it pauses for a long amount of time, it's probably dangerous. This approach has the benefit that you don't need to kill any threads. Once you determine it's dangerous and resume the suspended thread, it should release any locks and then the canary thread should resume normal operation.
  • Use native stackwalking APIs to apply heuristics to see if the thread is stopped in a dangerous place. We don't use this approach internally so I can only brainstorm what heuristics would be used. A broad one would be don't try to walk a stack if threads are stopped in ntdll, although it may be too broad and lose too many samples.

In coreclr 3.0 we added the ICorProfilerInfo10::SuspendRuntime API to make profilers able to sample on Linux, but it also solves this problem. The runtime will use its internal suspension mechanisms that it uses for GCs to stop the runtime, and then all managed threads will be in a safe state to sample. Since we control the suspension, you never have to worry about threads holding locks outside of the CLR's control. And since we use stackwalking during suspensions for the GC you can also be guaranteed you'll never have to worry about any internal deadlocks either.

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Mar 27, 2020
@tommcdon tommcdon added this to the 5.0 milestone Mar 27, 2020
@tommcdon tommcdon added the question Answer questions and provide assistance, not an issue with source code or documentation. label Mar 27, 2020
@tommcdon tommcdon modified the milestones: 5.0.0, 6.0.0 Jul 27, 2020
@tommcdon
Copy link
Member

Closing issue as it looks like the question has been answered. Please let us know if this is not correct and we will re-activate.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

5 participants