Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with RtlLookupFunctionEntry-based stack unwinding on 64-bit Windows 10 #12

Closed
michaelweiser opened this issue Jun 2, 2021 · 18 comments

Comments

@michaelweiser
Copy link

Starting a 64-bit Winword on current Windows 10 under observation by capemon gets stuck very early on. Attaching to the stuck winword.exe shows the following call stack:

ldrinvertedfunctiontable

Apparently, ntdll!RtlInsertInvertedFunctonTable is called and (according to disassembly of the function) exclusively acquires Slim Reader/Writer lock (SRW) ntdll!LdrpInvertedFunctionTableSRWLock. This would make sense as the function is likely to modify import tables.
While holding the lock, it calls ntdll!LdrProtectMrData which, according to my debugging, eventually calls ntdll!NtProtectVirtualMemory, likely to protect access to those tables (again). Since ntdll!NtProtectVirtualMemory is hooked by capemon, this triggers capemon_x64!enter_hook in order to decide whether to enter the capemon hook for that function or not. This decision is, amongst other things. based on whether the hook is itself called from a hook. To determine this, the stack is unwound by calling capemon_x64!our_stackwalk which on x64 is implemented using ntdll!RtlLookupFunctionEntry.
Unfortunately, this function does not appear to be safe for this kind of thing because it also acquires the already exclusively held ntdll!LdrpInvertedFunctionTableSRWLock. This leads to the observed deadlock.

Disabling hooking of ntdll!NtProtectVirtualMemory without involvement of stack unwinding mitigates the issue:

diff --git a/hooking.c b/hooking.c
index 46e6560..d0d566f 100644
--- a/hooking.c
+++ b/hooking.c
@@ -259,6 +260,9 @@ int WINAPI enter_hook(hook_t *h, ULONG_PTR sp, ULONG_PTR ebp_or_rip)
 {
        hook_info_t *hookinfo;

+       if (h->new_func == &New_NtProtectVirtualMemory)
+               return 0;
+
        if (h->fully_emulate)
                return 1;

This, however, leaves a massive blindspot regarding all calls to that function. This also only works because ntdll!NtProtectVirtualMemory appears to be the only hooked function called while ntdll!LdrpInvertedFunctionTableSRWLock is held.

Trying to hook ntdll!RtlInsertInvertedFunctonTable to temporarily disable hooking for all other APIs called from it (again without involving stack-based decisions) have not been successful because the hook does not seem to be called:

diff --git a/hook_special.c b/hook_special.c
index e0ad8ea..a81cbdc 100644
--- a/hook_special.c
+++ b/hook_special.c
@@ -39,6 +39,15 @@ static int bits_sent = 0;
 static int tasksched_sent = 0;
 static int interop_sent = 0;
 
+extern int g_hooking_disabled;
+
+HOOKDEF(VOID, WINAPI, RtlInsertInvertedFunctionTable, __in PVOID ImageBase, __in ULONG SizeOfImage) {
+	g_hooking_disabled = 1;
+	DebugOutput("RtlInsertInvertedFunctionTable called");
+	Old_RtlInsertInvertedFunctionTable(ImageBase, SizeOfImage);
+	g_hooking_disabled = 0;
+}
+
 HOOKDEF_NOTAIL(WINAPI, LdrLoadDll,
 	__in_opt	PWCHAR PathToFile,
 	__in_opt	PULONG Flags,
diff --git a/hooking.c b/hooking.c
index 46e6560..361b250 100644
--- a/hooking.c
+++ b/hooking.c
@@ -42,6 +42,7 @@ along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #define HOOK_RATE_LIMIT 0x100
 #define HOOK_LIMIT 0x10000
 
+int g_hooking_disabled = 0;
 static lookup_t g_hook_info;
 lookup_t g_caller_regions;
 
@@ -259,6 +260,9 @@ int WINAPI enter_hook(hook_t *h, ULONG_PTR sp, ULONG_PTR ebp_or_rip)
 {
 	hook_info_t *hookinfo;
 
+	if (g_hooking_disabled)
+		return 0;
+
 	if (h->fully_emulate)
 		return 1;
 
diff --git a/hooks.c b/hooks.c
index 6613af9..277af9d 100644
--- a/hooks.c
+++ b/hooks.c
@@ -184,6 +186,8 @@ hook_t full_hooks[] = {
 	HOOK_NOTAIL_ALT(kernelbase, MoveFileWithProgressTransactedW, 6),
 	HOOK_NOTAIL_ALT(kernel32, MoveFileWithProgressTransactedW, 6),
 
+	HOOK(ntdll, RtlInsertInvertedFunctionTable),
+
 	// File Hooks
 	HOOK(ntdll, NtQueryAttributesFile),
 	HOOK(ntdll, NtQueryFullAttributesFile),
diff --git a/hooks.h b/hooks.h
index 38203c6..d2d9a74 100644
--- a/hooks.h
+++ b/hooks.h
@@ -22,6 +22,10 @@ along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #include "ntapi.h"
 #include <tlhelp32.h>
 
+HOOKDEF(VOID, WINAPI, RtlInsertInvertedFunctionTable,
+	__in PVOID ImageBase, __in ULONG SizeOfImage
+);
+
 //
 // File Hooks
 //

diff --git a/misc.h b/misc.h
index 1d9ac32..2e222d1 100644
--- a/misc.h
+++ b/misc.h
@@ -101,6 +101,10 @@ typedef HRESULT (WINAPI *_ProgIDFromCLSID)(
 	_Out_ LPOLESTR *lplpszProgID
 );
 
+typedef VOID(WINAPI* _RtlInsertInvertedFunctionTable)(
+	_In_ PVOID						  ImageBase,
+	_In_ ULONG						  SizeOfImage);
+
 void resolve_runtime_apis(void);
 
 DWORD parent_process_id(); // By Napalm @ NetCore2K (rohitab.com)

Another idea I had (and found elsewhere: https://microsoft.public.win32.programmer.kernel.narkive.com/qxCAoEXI/using-rtllookupfunctionentry-for-profiling) was to try to acquire the lock from our_stackwalk to see if it was held or free and RtlLookupFunctionEntry would block or not. Unfortunately, the symbol is not exported from ntdll, so I cannot get at its address.

Other projects have run into this problem as well and proposed a number of solutions, e.g.: dotnet/runtime#32286, DynamoRIO/drmemory#1222

The issue seems to be somewhat Windows-10-specific, because the same capemon_x64.dll is able to start up and monitor the same version of 64-bit Winword on a 64-bit Windows 7 without above workarounds. My guess is that import table mechanics, at least regarding memory protections on them, have changed between Windows 7 and Windows 10. I have not analysed the differences in detail though.

Is my understanding of the mechanics at play correct?
Could my attempts at hooking RtlInsertInvertedFunctionTable or inspecting the state of LdrpInvertedFunctionTableSRWLock from our_stackwalk be made to work somehow?
Any ideas what else could be done about this issue?

Thanks!

@kevoreilly
Copy link
Owner

kevoreilly commented Jun 9, 2021

Michael again a massive thank you for this work. This deadlock is a killer and it would be great to swat it and get win10 compatibility closer to where it should be.

Your ideas are all great - just like the way I check for loader lock but there it's so much easier as it's in the peb!

But I think there is a solution to be had here. I found this from 2012 but the idea is spot on: http://workblog.pilin.name/2012/10/how-to-get-x64-dynamic-function-table.html

So I've just tried adding a simple hook for AcquireSRWLockExclusive as for Win10 we needn't worry about the critical section mentioned in that post. However I haven't yet seen it fire. Please do give it a try - once we have the hook working we can set a flag to enable pointer capture, then call the bogus RtlAddFunctionTable to get the pointer, then disable capture, delete the entry then add a check to enter_hook to immediately return 0 if SRWlock held. That's the plan anyway! Let me know if you want a compiled monitor to test with.

HOOKDEF(void, WINAPI, AcquireSRWLockExclusive,
	__in PSRWLOCK SRWLock
)
{
	int ret = 0;
	Old_AcquireSRWLockExclusive(SRWLock);
	LOQ_void("misc", "p", "SRWLock", SRWLock);
	DebugOutput("AcquireSRWLockExclusive: SRWLock 0x%p", SRWLock);
	return;
}

@michaelweiser
Copy link
Author

Hi Kevin, I've made some progress although no breakthrough yet.

I've added a call to RtlAddFunctionTable() to capemon's DllMain() and I can see the hook on AcquireSRWLockExclusive() learning the address of an SRW from that call. So the mechanics seem to be working in principle.

But unfortunately it's the wrong SRW, i.e. LdrpMrDataLock, and RtlAddFunctionTable() gets stuck trying to acquire it. My guess is that it's held by a routine that calls capemon's DllMain(). A DLL that's being initialised going around modifying the function table is likely not something Microsoft wants people to do. :/

Here's a screenshot of WinDbg of that state with output from the analyser:
mrdata-deadlock

The code is at michaelweiser@36fb25b.

So we'd likely need to defer the SRW discovery to a routine that's not part of DLL initialisation itself. Any idea of a good candidate?

Other stuff I've learned:

  • we've got to be incredibly careful where to put DebugOutput()'s because that's taking out locks on SRWs as well, worst case creating infinite recursion if there's a DebutOutput() in AcquireSRWLockShared(). DebugOutput() being called while our discovery flag is switched on will make us learn the address of the wrong SRW.
  • after seeing some duplicate debug output from what seemed to be threads racing each other I sprinkled in some serialisation using a critical section - it's unclear to me at this point if that is really necessary
  • I've added some short-circuit dispatch for the SRW locking hooks into enter_hook as before so we don't trigger calls to RtlLookupFunctionEntry too early - not all of that will eventually be necessary, I assume
  • finally I've added a hard guard against any unwinding in our_stackwalk() until we've learned the SRW - it never really came into play up to now

I've also tried instead calling RtlLookupFunctionEntry() from DllMain() but apparently didn't provide it with the right parameters because it didn't even try to take out a lock on any SRW. If you have an idea how we could construct a stack frame for it to unwind which would reliably trigger use of the SRW, this might be a nice approach because it'd work without modifying any system structures like RtlAddFunctionTable() does.

Trying a totally different approach, I've also attempted to learn the address of LdrpInvertedFunctionTableSRWLock from calls to AcquireSRWLockShared() by RtlLookupFunctionEntry() from our_stackwalk() in hopes that there would be some non-deadlocking ones first which take out the SRW in question before running into the case of that lock already being held exclusively. Unfortunately we don't seem to be that lucky and it seems to deadlock on the first call already.

I have a lingering suspicion that some of the above stuff didn't work for me because I ran afoul of the complexity of the hook call chains involved and was just staring at bogus output or not reading it right. So any critical feedback on and maybe testing of the code is highly welcome.

I'll be away starting tomorrow all next week so don't be alarmed if I fall silent for some time. @Jack28 is in the loop what we're trying to achieve and can maybe do some additional tests.

@michaelweiser
Copy link
Author

Looking at this again today, I read the disassembly of RtlAddFunctionTable() not to use LdrpInvertedFunctionTableSRWLock at all. So this approach could really only be used to get at LdrpMrdataLock and RtlpDynamicFunctionTableLock.

RtlLookupFunctionEntry() on the other hand seems to have the equivalent of AcquireSRWLockShared/Exclusive() inlined when accessing LdrpInvertedFunctionTableSRWLock which would explain why it does not appear in the backtrace and directly ends up at NtWaitForAlertByThreadId(). So hooking that function might get us the SRW but would require to get it exclusively locked first.

Searching the disassembly of ntdll in WinDbg reveals that RtlRemoveInvertedFunctionTable() accesses LdrpInvertedFunctionTableSRWLock unconditionally and uses RtlAcquireSRWLockExclusive() to do so. This is likely going to be my next line of attack on the problem then. First step would be to figure out its prototype and call semantics, e.g. if it will gracefully fail if called with invalid data which would be ideal for our purpose.

@kevoreilly
Copy link
Owner

Hi Michael, I too have been exploring a couple of possibilities; the most promising is using yara to locate RtlInsertInvertedFunctionTable and hence LdrpInvertedFunctionTableSRWLock, since we already have it compiled into the monitor. This would be nice and unintrusive, find via yara then add check to stackwalk function... I'll let you know how I get on...

@michaelweiser
Copy link
Author

After some more digging, calling RtlInsertInvertedFunctionTable() and RtlRemoveInvertedFunctionTable() directly is off the table because they're not exported from ntdll. All other functions calling them are either not exported themselves or way too high-level already.

Before spending more time on this I'd love to hear if you found a way to discover those functions or the SRW at runtime. Because they don't seem to be identified by name or any other means within the DLL (because they're not exported), I could only think of using the debug symbols (which differ and would need to be downloaded for every build of ntdll) or matching some distinctive code sequence, e.g. cmpxchg being used to access the SRW and its surroundings (but which might also change depending on compiler flags used by upstream when compiling it). There's very likely a relocation for each private function and the SRW in the DLL image but those can't be distinguished from all the others either.

I'm now wondering if it might actually be simpler to try and use the canary thread approach suggested by the dotnet guys, i.e. have another thread call RtlLookupFunctionEntry() just to see if that blocks and if it does not report back within a couple of microseconds refrain from calling the function from our_stackwalk(). Once the hooked function in question finishes and releases the SRW, the canary thread should then unblock automatically as well.

@kevoreilly
Copy link
Owner

Hi Michael, I've managed to avoid the deadlock by using yara to locate RtlInsertInvertedFunctionTable and hence LdrpInvertedFunctionTableSRWLock. Here's a test build of capemon for you to try - let me know if it works!
capemon_x64.zip

@michaelweiser
Copy link
Author

Hi Kevin, I've tried it. Unfortunately, it still hangs but at a different point. See below screenshot for the call stack.

Unfortunately, your build doesn't include symbols. So I can not do much more digging into it to tell the reason for the hang. What I can tell is that RtlLookupFunctionTable is called and tries to acquire ntdll!LdrpInvertedFunctionTableSRWLock and gets stuck once more.

Can you provide a debug build or, even better, the source code of your modifications?

BTW: I'll be unavailable next week but once again @Jack28 is in the loop and happy to try out what you throw at him.
call-stack

@kevoreilly
Copy link
Owner

Aha I've just realised yara is disabled with office settings - can you try launching without?

@kevoreilly
Copy link
Owner

Happy to share source of course - bit of a mess currently as it's been hacked together but I'll try and tidy it up so I can share

@kevoreilly
Copy link
Owner

LdrpInvertedFunctionTableSRWLock.zip
Here is the code

@michaelweiser
Copy link
Author

@Jack28 and now I have been looking into this again. In addition to setting the office option to 0 in the doc and doc2016 packages we needed to remove data\yara\capemon.yac to get the new rule added to that cache(?). With that we see:

2021-07-20 12:36:44,256 [root] DEBUG: YaraInit: Compiled rules loaded from existing file C:\word-2016-dbg\data\yara\capemon.yac
2021-07-20 12:36:44,256 [root] DEBUG: YaraScan: Scanning 0x00007FFBE0E10000, size 0x1f5000
2021-07-20 12:36:44,271 [root] DEBUG: YaraScan: AccessibleSize 0x1f5000
2021-07-20 12:36:44,287 [root] DEBUG: YaraScan rule did not match.
[...]
2021-07-20 12:36:44,302 [root] DEBUG: YaraScan rule did not match.
2021-07-20 12:36:44,302 [root] DEBUG: YaraScan hit: RtlInsertInvertedFunctionTable
2021-07-20 12:36:44,318 [root] DEBUG: YaraScan string match: $function (0x1090e)
2021-07-20 12:36:44,318 [root] DEBUG: RtlInsertInvertedFunctionTable 0x00007FFBE0E2090E, LdrpInvertedFunctionTableSRWLock 0x00007FFBE0F7B4F0
2021-07-20 12:36:44,318 [root] DEBUG: YaraScan: successfully scanned 0x00007FFBE0E10000

The address correctly comes back as ntdll!LdrpInvertedFunctionTableSRWLock+0 in WinDbg.
critsec
Great stuff! I'd love to understand how exactly you extract the offset (relative to what?) from the lea instruction using

LdrpInvertedFunctionTableSRWLock = (PVOID)((PBYTE)RtlInsertInvertedFunctionTable + *(DWORD*)((PBYTE)RtlInsertInvertedFunctionTable + 3) + 7);

but I can accept with my limitations since it works. ;)

Winword still doesn't fully start though. Looking at the source code it seems the logic is not fully active yet, ending up in disabling hook cycle detection. Should we try and dig into that or are you on it already?

Or is the code

        if (*(PVOID*)LdrpInvertedFunctionTableSRWLock)
                return TRUE;

meant to be a lightweight version of

        //if (TryAcquireSRWLockExclusive((PSRWLOCK)&LdrpInvertedFunctionTableSRWLock)) {
        //      ReleaseSRWLockExclusive((PSRWLOCK)&LdrpInvertedFunctionTableSRWLock);
        //      return FALSE;
        //}

?

For the latter I wonder whether its potential failure to function may be due to PVOID LdrpInvertedFunctionTableSRWLock being a (typeless) pointer to the SRWLOCK structure (aka PSRWLOCK) already and taking the address of that pointer variable again ((PSRWLOCK)&LdrpInvertedFunctionTableSRWLock) makes the function work with the wrong address. I may just have knots in my brain though. :)

@kevoreilly
Copy link
Owner

kevoreilly commented Nov 12, 2021

I found that this issue was affecting a range of 64-bit programs, particularly gui apps. As I still haven't a 64-bit Office to test with, I decided to look at another easier app that was similarly affected to test on: x64dbg. With this as a test case, the code committed in aa1fd55 fixes the issue. The app no longer hangs on this deadlock and loads on 64-bit Win10:

image

So I am led to think that this specific issue is solved, and that therefore a separate distinct issue is now preventing Word from opening. Certainly the logic committed is tested and working, the commented-out test code with the TryAcquireSRWLockExclusive api was just that; I tested this and it didn't work. When I tried testing the 'raw' lock value that worked, so I went with that.

As for the code to locate the lock, if you look at the disassembly of RtlInsertInvertedFunctionTable, the logic relies entirely on this line:

image

The last 4 bytes of this instruction represent little-endian offset so 0x13954F. This is relative to the end of the instruction which is the VA 0x7FFC71CE2F8A plus the size 7, so 0x7FFC71CE2F91. Adding the two gives 0x7FFC71E1C4E0 which is the address of the lock:

image

Please let me know if you agree that this issue is fixed, then we can perhaps create a new issue for that which now prevents Word from launching.

@kevoreilly
Copy link
Owner

Any feedback on this? I am still of the mind that this specific issue is fixed...

@michaelweiser
Copy link
Author

Thanks for your update and explanations! Doing some tests with Winword 2013 and 2016, the new logic reliably prevents the deadlock on startup. Both crash soon thereafter but those seem to be unrelated problems. I'll try and dig into those and report any separate issues I may be able to identify. I think this issue here can be closed. Thanks again!

@kevoreilly
Copy link
Owner

Well worth trying options like minhook=1 or other hook exclusions to try to rule out a hooking issue.

@michaelweiser
Copy link
Author

No change in behaviour with minhook=1 in a quick try. Will dig some more and report in a separate issue if I come up with anything.

@RaduEmanuel92
Copy link

RaduEmanuel92 commented Nov 9, 2022

Hello @kevoreilly !
I believe that I've identified another case of this issue still reproducing in another x64-compiled binary.
Here is the C code necessary for repro:

#define _WIN32_WINNT 0x0400
#define _WIN32_DCOM

#include <stdio.h>
#include <tchar.h>
#include <windows.h>
#include <wbemidl.h>
#include <combaseapi.h>

int main(void)
{

    HRESULT hr = 0;

    // BSTR strings we'll use (http://msdn.microsoft.com/en-us/library/ms221069.aspx)
    BSTR resource = SysAllocString(L"ROOT\\CIMV2");
    BSTR language = SysAllocString(L"WQL");

    BSTR cpu_query = SysAllocString(L"SELECT * FROM Win32_Processor");

    // initialize COM
    wprintf(L"%s\r\n", L"[*] Initialize COM ...");
    hr = CoInitializeEx(0, COINIT_MULTITHREADED);

    wprintf(L"%s\r\n", L"[*] Initialize COM Security ...");
    hr = CoInitializeSecurity(NULL, -1, NULL, NULL, RPC_C_AUTHN_LEVEL_DEFAULT, RPC_C_IMP_LEVEL_IMPERSONATE, NULL, EOAC_NONE, NULL);
    
    // COM interface pointers
    IWbemLocator* locator = NULL;
    IWbemServices* services = NULL;
    IEnumWbemClassObject* results = NULL;

    // connect to WMI
    wprintf(L"%s\r\n", L"[*] Connect to WMI ...");
    hr = CoCreateInstance(&CLSID_WbemLocator, 0, CLSCTX_INPROC_SERVER, &IID_IWbemLocator, (LPVOID*)&locator);
    hr = locator->lpVtbl->ConnectServer(locator, resource, NULL, NULL, NULL, 0, NULL, NULL, &services);

    // issue a WMI query
    wprintf(L"%s\r\n", L"[*] Execute query ...");
    hr = services->lpVtbl->ExecQuery(services, language, cpu_query, WBEM_FLAG_BIDIRECTIONAL, NULL, &results);

    // list the query results
    if (results != NULL) {
        IWbemClassObject* result = NULL;
        ULONG returnedCount = 0;
        wprintf(L"%s\r\n", L"CPU data:");
        // enumerate the retrieved objects
        while ((hr = results->lpVtbl->Next(results, WBEM_INFINITE, 1, &result, &returnedCount)) == S_OK) {
            VARIANT name;
            VARIANT speed;

            // obtain the desired properties of the next result and print them out
            hr = result->lpVtbl->Get(result, L"Name", 0, &name, 0, 0);
            hr = result->lpVtbl->Get(result, L"MaxClockSpeed", 0, &speed, 0, 0);
            
            wprintf(L"%s, %dMHz\r\n", name.bstrVal, speed.intVal);

            // release the current result object
            result->lpVtbl->Release(result);
        }
    }
    // release WMI COM interfaces
    results->lpVtbl->Release(results);
    services->lpVtbl->Release(services);
    locator->lpVtbl->Release(locator);

    // unwind everything else we've allocated
    CoUninitialize();

    SysFreeString(cpu_query);
    SysFreeString(language);
    SysFreeString(resource);

    system("pause");
    return 0;
}

wbemuuid.lib` needs to be added in Linker/Input in VS project properties.

Compiling and running the binary under loader leads to a hang:
image

Again, as above, deactivating the NtProtectVirtualMemory hook leads to expected behavior:
image

image

I've tried various things to solve the issue.

  1. Firstly, I wanted to see how many times the enter_hook method is called by adding a logging message:
    image
    By only doing this small change, the code hangs again. Which leads me to think that sending data to the debugger is also interfering with the execution?

  2. Trying to modify the allow_hook_recursion to 0 is not changing the outcome.

  3. As a workaround, I was thinking about implementing hooks for VirtualProtect and VirtualProtectEx, and disable the ntdll hooks, at least for windows 10 x64.

  4. Ultimately, I am thinking that the regexes used in YaraHarness.c looking for RtlInsertInvertedFunctionTable are not matching this type of execution?

As a side note, looking for the system processes and installed application using WMI, exposes the sandbox-related processes and applications to the malware process. At the moment, these are protected only against listing the processes with CreateToolHelp32Snapshot WIN API. This means that querying WMI is a viable cape sandbox evasion. Implementing behavior-changing hooks against WMI queries should mitigate this.

I will attempt a pull request w.r.t the above observation.

Ultimately, thank you for this great sandbox!

Edit: typos.

@kevoreilly
Copy link
Owner

Thank you @RaduEmanuel92 for reporting this issue. It is the same as issue #49 reported today, now fixed in 65f4e2f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants