Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Use GetFinalPathNameByHandle for Process.open_files #2190

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jschwartzentruber
Copy link

Summary

Description

This uses GetFinalPathNameByHandle which was added in Vista/Server 2008. This fixes #1967 based on the steps to reproduce there (only tested on Windows 11).

@jschwartzentruber jschwartzentruber marked this pull request as draft January 8, 2023 03:21
@jschwartzentruber
Copy link
Author

I'll try to reproduce the CI failures on other Windows versions.

@giampaolo
Copy link
Owner

giampaolo commented Jan 8, 2023

Interesting. FYI, the precarious situation with open_files() on Windows goes back to 9 years ago and never got fully resolved. Searching for NtQueryObject also gives an idea on the amount troubles caused by this API over the years. Some considerations:

  1. according to this old comment, which is of course unverified, GetFileInformationByHandle would still block, and you should use GetFileInformationByHandleEx instead. Also I see there are some interesting comments to the original blog post, which of course are also unverified.
  2. ProcessHacker is probably the best source of truth here. I looked at ProcessHacker source code and unfortunately it doesn't use GetFinalPathNameByHandle/Ex, it uses NtQueryObject, like we do, so that makes me skeptical. CC @dmex / @wj32 in case they have some comment on this (I hope you won't mind my CC).

Just to make things clear, a proper solution would have to:

  1. NOT hang, and to my understanding the current code should not because we spawn threads
  2. fix this bug [OS] Process.open_files() unsafe API usage can lead to deadlock #1967, which currently affects us

@Pololot64
Copy link

I have been looking into this as well. I normally do pure python but does the deadlock refer to the computer hanging for about half a second every time open_files() is used? It seems to me that NtQuerySystemInformation is what causes the hang even with this fixed code.

@jschwartzentruber
Copy link
Author

according to this old comment, which is of course unverified, GetFileInformationByHandle would still block, and you should use GetFileInformationByHandleEx instead. Also I see there are some interesting comments to the original blog post, which of course are also unverified.

I looked into this, including the upstream bug report, and it was closed by MS as "by design", the description "for certain type of handles", and one of the comments says to use GetFileType to check only disk files. This all leads me to believe the author was not filtering for pipes or directory handles.

it uses NtQueryObject, like we do

I'm also curious about this. Is it possible that NtQueryObject is fine, and psutil should just filter handles differently? Even so, it would be better to use a supported API as long as it doesn't regress functionality or add bugs.

NOT hang, and to my understanding the current code should not because we spawn threads

I will add back the thread wrapper. I removed it optimistically, but I'm sure there would be situations where even a "fixed" API could hang (slow/broken SMB connections, etc.).

@jschwartzentruber
Copy link
Author

does the deadlock refer to the computer hanging for about half a second every time open_files() is used? It seems to me that NtQuerySystemInformation is what causes the hang even with this fixed code.

This is where the call never completes, and is unrecoverable. That's the issue described in #1967 and what is affecting me.

If you're trying my fix, I prematurely removed the threading code, so that may re-introduce #340. I'm going to add the thread back but around the newer API and test that on older platforms.

@giampaolo
Copy link
Owner

giampaolo commented Jan 9, 2023

Is it possible that NtQueryObject is fine, and psutil should just filter handles differently?

Yes, it is possible that psutil's use of GetFileType is the problem (or a problem). ProcessHacker doesn't use it (which is already a red flag). If I'm reading the code right, it uses NtQueryObject with ObjectTypesInformation instead, so I definitively think this is very well worth a try.

Also, there's an interesting comment in PH source code:

    // There are three ways of enumerating handles:
    // * On Windows 8 and later, NtQueryInformationProcess with ProcessHandleInformation is the most efficient method.
    // * On Windows XP and later, NtQuerySystemInformation with SystemExtendedHandleInformation.
    // * Otherwise, NtQuerySystemInformation with SystemHandleInformation can be used.

We use NtQuerySystemInformation with SystemExtendedHandleInformation (second method) whereas ProcessHacker uses NtQueryInformationProcess with ProcessHandleInformation on Windows 8+ (first method). Probably this will only make a difference in terms of performance, but may be also worth a try.

@dmex
Copy link

dmex commented Jan 10, 2023

I looked at ProcessHacker source code and unfortunately it doesn't use GetFinalPathNameByHandle/Ex, it uses NtQueryObject, like we do

@giampaolo

Process Hacker uses the Native API instead of the Win32 API. GetFinalPathNameByHandle/Ex are both Win32 APIs exposing a limited number of information classes from the NtQueryInformationFile and NtQueryDirectoryFile Native APIs.

GetFileInformationByHandleEx is used by Process Hacker but you need to search for NtQueryInformationFile 😉

Yes, it is possible that psutil's use of GetFileType is the problem (or a problem). ProcessHacker doesn't use it (which is already a red flag).

GetFileType is another Win32 API and it's internally calling the NtQueryVolumeInformationFile Native API with the FileFsDeviceInformation information class.

Process Hacker does use GetFileType but you need to search for NtQueryVolumeInformationFile and FileFsDeviceInformation 😉

We've never run into issues with GetFileType/NtQueryVolumeInformationFile(FileFsDeviceInformation) and it's always successfully returned the DeviceType for the handle without deadlocking unlike NtQueryObject and NtQueryInformationFile: https://github.com/winsiderss/systeminformer/blob/f17d0dc7c770fb2453bb604c39a43508bd692e07/SystemInformer/hndlprp.c#L1337-L1374

it uses NtQueryObject, like we do, so that makes me skeptical
This blog post (...) suggests using GetFileInformationByHandleEx

NtQueryInformationFile (GetFinalPathNameByHandle/Ex) deadlocks just like NtQueryObject does especially on Windows 10/11 for named pipes (FILE_DEVICE_NAMED_PIPE), console handles (FILE_DEVICE_CONSOLE) and some other device types.

You can see here how we've had to use the same workaround for NtQueryObject with NtQueryInformationFile here:
https://github.com/winsiderss/systeminformer/blob/f17d0dc7c770fb2453bb604c39a43508bd692e07/phlib/hndlinfo.c#L2314-L2323
and here:
https://github.com/winsiderss/systeminformer/blob/f17d0dc7c770fb2453bb604c39a43508bd692e07/phlib/hndlinfo.c#L2425-L2442
and here:
https://github.com/winsiderss/systeminformer/blob/f17d0dc7c770fb2453bb604c39a43508bd692e07/SystemInformer/hndlprp.c#L1169-L1172

Note: These days we're mostly avoiding issues with NtQueryObject by using our kernel driver to query the name and all this code is only used as fallback when the driver isn't available.

fix this bug [OS] Process.open_files() unsafe API usage can lead to deadlock #1967, which currently affects us
The design of Process.open_files() has a serious design flaw that can lead to a deadlock.

ntdll!LdrpLoaderLock is only used when loading a DLL with functions like LoadLibrary and LdrLoadDll. The psutil_get_filename thread does not call any loader functions so it doesn't make sense how terminating the thread leaks the LdrpLoaderLock?

I tried debugging the psutil code and I think it's very likely the calls to psutil_SetFromNTStatusErr(), PyErr_NoMemory() and PyObject_CallFunction inside the psutil_get_filename function causing issues with thread termination. If these functions attempt to convert the error code to strings (or otherwise use FormatMessageW) then it's using the loader to locate the kernel32/ntdll images and copy the RT_MESSAGETABLE resource and the thread was terminated

If it's not those functions then another likely reason is because the psutil_get_filename function can be terminated while calling HeapAlloc/HeapFree and leaking memory but more importantly leaking the lock for the process default heap. Leaking the heap lock would block the entire process and most system functions from allocating/freeing memory and manifest as unrelated failures elsewhere in the application.

For example: The LoadLibrary function uses worker threads when loading DLLs on Windows 8/10/11 to improve the performance of import table snaps. A single worker thread terminating because it couldn't access the process heap would leak the LdrpLoaderLock and explain the exited thread (worker factory) referencing the LdrpLoaderLock in that other github issue (and obfuscate the real issue which would be using heap functions while terminating the psutil_get_filename thread).

There are also some other issues with the psutil_get_filename function. For reference here is the code for psutil_get_filename:

static int
psutil_get_filename(LPVOID lpvParam) {
HANDLE hFile = *((HANDLE*)lpvParam);
NTSTATUS status;
ULONG bufferSize;
ULONG attempts = 8;
bufferSize = 0x200;
globalFileName = MALLOC_ZERO(bufferSize);
if (globalFileName == NULL) {
PyErr_NoMemory();
goto error;
}
// Note: also this is supposed to hang, hence why we do it in here.
if (GetFileType(hFile) != FILE_TYPE_DISK) {
SetLastError(0);
globalFileName->Length = 0;
return 0;
}
// A loop is needed because the I/O subsystem likes to give us the
// wrong return lengths...
do {
status = NtQueryObject(
hFile,
ObjectNameInformation,
globalFileName,
bufferSize,
&bufferSize
);
if (status == STATUS_BUFFER_OVERFLOW ||
status == STATUS_INFO_LENGTH_MISMATCH ||
status == STATUS_BUFFER_TOO_SMALL)
{
FREE(globalFileName);
globalFileName = MALLOC_ZERO(bufferSize);
if (globalFileName == NULL) {
PyErr_NoMemory();
goto error;
}
}
else {
break;
}
} while (--attempts);
if (! NT_SUCCESS(status)) {
psutil_SetFromNTStatusErr(status, "NtQuerySystemInformation");
FREE(globalFileName);
globalFileName = NULL;
return 1;
}
return 0;
error:
if (globalFileName != NULL) {
FREE(globalFileName);
globalFileName = NULL;
}
return 1;
}

Compare psutil_get_filename to our PhpCommonQueryObjectRoutine:
https://github.com/winsiderss/systeminformer/blob/f17d0dc7c770fb2453bb604c39a43508bd692e07/phlib/hndlinfo.c#L2280C29-L2296

Note how our function doesn't allocate memory for the string and doesn't do anything except call NtQueryObject and pass values directly into a structure? It has to be done this way to safely terminate the thread. The psutil_get_filename function needs to stripped down the exact same way otherwise you'll keep running into problems after terminating the thread.

  • You need to remove HeapAlloc, HeapFree, psutil_SetFromNTStatusErr, PyErr_NoMemory from psutil_get_filename and into psutil_threaded_get_filename to make the psutil_get_filename function safe to terminate.
  • Also move the do { } while (--attempts) loop from psutil_get_filename into psutil_threaded_get_filename (Yes, you need to call CreateThread multiple times for each different memory allocation)
  • Remove SetLastError() from psutil_get_filename. There's no point calling SetLastError and immediately exiting the thread.
  • Remove globalFileName from the global scope and use a structure. You can pass multiple objects via a structure with CreateThread and access those same objects once the thread exits or terminates.
  • Delete EnterCriticalSection(&PSUTIL_CRITICAL_SECTION);... The structure you pass into CreateThread correctly shares the data between those two threads and this critical section is no longer required.

Here's a breakdown of the changes required with examples:


  1. Create a structure for the FileHandle, NTSTATUS, FileName and FileNameSize:
typedef struct _psutil_query_thread
{
  NTSTATUS status;
  HANDLE FileHandle;
  PUNICODE_STRING globalFileName;
  SIZE_T FileNameSize;
} psutil_query_thread;

This is what the psutil_get_filename function should be doing and nothing else:

static int psutil_get_filename(LPVOID lpvParam) 
{
    psutil_query_thread* context = lpvParam;

     context->status = NtQueryObject(
            context->FileHandle,
            ObjectNameInformation,
            context->globalFileName,
            context->FileNameSize,
            &context->FileNameSize
            );

    return 0;
}

Now the psutil_threaded_get_filename function should be written like this:

  • (optional) First check if the handle is a file. Fail if the handle isn't a file.
 if (GetFileType(hFile) != FILE_TYPE_DISK)
{
        return 1;
}
  • Start our query loop...
    do
    {
  • Allocate the structure before calling CreateThread.
psutil_query_thread* threadcontext = malloc()
threadcontext->status = STATUS_UNSUCCESSFUL; // initialize with a failure status
threadcontext->FileHandle = hFile;
threadcontext->globalFileName = malloc(1234);
threadcontext->FileNameSize = 1234;
  • Pass the structure into the psutil_get_filename function using CreateThread:
HANDLE threadHandle = CreateThread((LPTHREAD_START_ROUTINE)psutil_get_filename, threadcontext);
  • Wait for the thread to terminate or timeout:
WaitForSingleObject(threadHandle, THREAD_TIMEOUT);

// Optionally check the thread exit code. It will only return 0 after NtQueryObject completes otherwise STILL_ACTIVE.
GetExitCodeThread() == STILL_ACTIVE

  • Terminate the thread with a failure status (ensures we can safely access the threadcontext)
TerminateThread(threadHandle, 1);
  • The thread was terminated. We can safely access the threadcontext and check the status. If the buffer wasn't large enough then we realloc and call CreateThread again:
        if (threadcontext->status == STATUS_BUFFER_OVERFLOW | threadcontext->status == STATUS_INFO_LENGTH_MISMATCH ||
            threadcontext->status== STATUS_BUFFER_TOO_SMALL)
        {
            free();
            buffer = malloc(); // Increase the buffer and call CreateThread again.
        }
        else
        {
            break; // Success... Break out of our loop.
        }
    } while (--attempts);

If we reach this point we've either made 8 attempts at querying the handle and still failed or we've succeeded and have a valid string.

if (threadcontext->status == STATUS_SUCCESS)
{
    printf("%s", threadcontext->globalFileName->Buffer)
}

free(threadcontext);

use GetFileType to check only disk files. This all leads me to believe the author was not filtering for pipes or directory handles.

We've never run into deadlocks with GetFileType/NtQueryVolumeInformationFile and you won't deadlock querying files or directories only when you query handles for devices like NPFS (FILE_DEVICE_NAMED_PIPE), ConDrv (FILE_DEVICE_CONSOLE) or VolMgrControl.

Process Hacker is designed to show handles so we don't have the luxury of limiting the query. If psutil Process.open_files is only supposed to return files then you should limit the filename query to FILE_DEVICE_DISK and have a separate method like Process.open_handles for handles.

CC @dmex in case they have some comment on this (I hope you won't mind my CC).

Feel free to @ me if you get stuck fixing psutil_get_filename or need more info about what Process Hacker is doing (Btw Process Hacker was renamed System Informer so I'll be referring to SI instead of PH going forward) 😜

There are some other methods that I haven't looked into to avoid deadlocks like process reflection. These processes can't execute and we recently started using reflection to avoid some other types of deadlocks with running processes and querying handles via reflection might also work. Let me know how you go with all the above wall of text spam and if you're still having issues then let me know and I'll see if our snapshotting/reflection stuff also avoids the deadlocks with NtQueryObject.

-dmex

@giampaolo
Copy link
Owner

giampaolo commented Jan 10, 2023

Thanks for the amazing write up @dmex. It's an honor to meet you here. Needless to say, PH / SI inspired a lot of the psutil development on Windows, so I want to take this opportunity to finally say hi and thank you! I believe @wj32 even contributed some code to psutil many (>10) years ago.

Feel free to @ me if you get stuck fixing psutil_get_filename or need more info about what Process Hacker is doing

I appreciate it. It's a lot of info to digest, but the overall suggestions appear clear.

Btw Process Hacker was renamed System Informer so I'll be referring to SI instead of PH going forward

That surprised me. PH is so etched into my brain that it will take me a while to get used to the new name. :)

@giampaolo
Copy link
Owner

giampaolo commented Jan 10, 2023

If psutil Process.open_files is only supposed to return files then you should limit the filename query to FILE_DEVICE_DISK and have a separate method like Process.open_handles for handles.

Yes, we only return files (paths). No handles.

@jschwartzentruber
Copy link
Author

I've implemented the ideas from @dmex, but I still see deadlock. Although less often, it's still fairly easy to reproduce using the steps in #1967 (using 32-bit Python 3.7 on Windows Vista). I can't reproduce on 64-bit Python 3.10 Windows 11, but I couldn't using GetFinalPathNameByHandle either.

!locks still shows LdrpLoaderLock.

@dmex
Copy link

dmex commented Jan 11, 2023

@jschwartzentruber

I was able to resolve the deadlock issues with NtQueryObject by using the ReOpenFile function to clone/recreate the internal file_object referenced by handle.

Try using this function to query the handles with psutil (It's not 100% complete but should be straightforward enough for testing):

BOOLEAN QueryHandleFileName(
    _In_ HANDLE ProcessHandle,
    _In_ HANDLE Handle
    )
{
    BOOLEAN handleNeedsFree = FALSE;
    HANDLE dupHandle = NULL;
    HANDLE safeHandle = NULL;

    if (ProcessHandle == GetCurrentProcess())
    {
        dupHandle = Handle;
    }
    else
    {
        if (!DuplicateHandle(
            ProcessHandle,
            Handle,
            GetCurrentProcess(),
            &dupHandle,
            0,
            FALSE,
            0
            ))
        {
            return FALSE;
        }

        handleNeedsFree = TRUE;
    }

    safeHandle = ReOpenFile(dupHandle,
        0,
        FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        0 // FILE_FLAG_OVERLAPPED
        );

    if (safeHandle != INVALID_HANDLE_VALUE) // && (GetLastError() != ERROR_INVALID_PARAMETER)
    {
        ULONG bufferSize = sizeof(OBJECT_NAME_INFORMATION) + (MAXIMUM_FILENAME_LENGTH * sizeof(WCHAR));
        POBJECT_NAME_INFORMATION buffer = malloc(bufferSize); memset(buffer, 0, bufferSize);

        if (NT_SUCCESS(NtQueryObject(
            safeHandle,
            ObjectNameInformation,
            buffer,
            bufferSize,
            &bufferSize
            )))
        {
            printf("Success: %wZ\n", buffer);
            //return TRUE;
        }

        CloseHandle(safeHandle);
        free(buffer);
    }

    if (handleNeedsFree)
    {
        CloseHandle(dupHandle);
    }

    return FALSE;
}

@jschwartzentruber
Copy link
Author

I was able to resolve the deadlock issues with NtQueryObject

@dmex thanks again! I don't see deadlock either with your change.

@giampaolo giampaolo marked this pull request as ready for review January 11, 2023 16:01
@jschwartzentruber
Copy link
Author

On further testing, this doesn't filter for pipes. I'm not seeing it hang/deadlock, but I do see OSError originated from NtQueryObject.

On Vista I see [WinError 233] No process is on the other end of the pipe and on Windows 11 I see [WinError 50] The request is not supported.

@jschwartzentruber
Copy link
Author

Above is visible by running python -c "import psutil; [proc.as_dict() for proc in psutil.process_iter()]" as administrator. Note that the current commit will say NtQuerySystemInformation in the exception because of copy-paste here, it's actually from NtQueryObject.

@dmex
Copy link

dmex commented Jan 16, 2023

On further testing, this doesn't filter for pipes. I'm not seeing it hang/deadlock, but I do see OSError originated from NtQueryObject.

@jschwartzentruber That would be the expected behavior. Named pipes are created using NtCreateNamedPipeFile (npfs.sys) instead of NtCreateFile and are 'connected' by chaining each RootDirectory to the new handle. The ReOpenFile function creates an entirely new FILE_OBJECT that doesn't have the same RelatedFileObject/FileObjectExtension required for named pipes:

ntdll!_FILE_OBJECT
   +0x000 Type             : Int2B
   +0x002 Size             : Int2B
   +0x008 DeviceObject     : Ptr64 _DEVICE_OBJECT
   +0x010 Vpb              : Ptr64 _VPB
   +0x018 FsContext        : Ptr64 Void
   +0x020 FsContext2       : Ptr64 Void
   +0x028 SectionObjectPointer : Ptr64 _SECTION_OBJECT_POINTERS
   +0x030 PrivateCacheMap  : Ptr64 Void
   +0x038 FinalStatus      : Int4B
   +0x040 RelatedFileObject : Ptr64 _FILE_OBJECT
   +0x048 LockOperation    : UChar
   +0x049 DeletePending    : UChar
   +0x04a ReadAccess       : UChar
   +0x04b WriteAccess      : UChar
   +0x04c DeleteAccess     : UChar
   +0x04d SharedRead       : UChar
   +0x04e SharedWrite      : UChar
   +0x04f SharedDelete     : UChar
   +0x050 Flags            : Uint4B
   +0x058 FileName         : _UNICODE_STRING
   +0x068 CurrentByteOffset : _LARGE_INTEGER
   +0x070 Waiters          : Uint4B
   +0x074 Busy             : Uint4B
   +0x078 LastLock         : Ptr64 Void
   +0x080 Lock             : _KEVENT
   +0x098 Event            : _KEVENT
   +0x0b0 CompletionContext : Ptr64 _IO_COMPLETION_CONTEXT
   +0x0b8 IrpListLock      : Uint8B
   +0x0c0 IrpList          : _LIST_ENTRY
   +0x0d0 FileObjectExtension : Ptr64 Void

If psutil Process.open_files is only supposed to return files then you should limit the filename query to FILE_DEVICE_DISK and have a separate method like Process.open_handles for handles.
Yes, we only return files (paths). No handles.

This is why I asked if Process.open_files was only supposed to return files because the ReOpenFile method won't work for named pipes. There's two options:

  1. GetFileType and exclude pipes (ReOpenFile is doing this already)
  2. NtQueryInformationFile with FilePipeLocalInformation and check NamedPipeEnd==FILE_PIPE_CLIENT_END or FILE_PIPE_SERVER_END. The server side of the pipe blocks when using NtQueryObject and requires using a thread.

Remember that DuplicateHandle continues referencing the existing FILE_OBJECT and NtQueryObject deadlocks because the Lock and Event of the FILE_OBJECT is already inuse by the original process owning the handle. ReOpenFile creates a completely new FILE_OBJECT with a new Lock and Event that isn't shared (or held/inuse) which is why it avoids deadlocking when calling NtQueryObject.

If you need to query FILE_DEVICE_NAMED_PIPE or FILE_DEVICE_CONSOLE then ReOpenFile won't work because those file objects are 'extensions' meaning you'll still have to fallback to the original handle and a thread in those cases.

I've implemented the ideas from @dmex, but I still see deadlock. Although less often,

Can you enable loader snaps and find the DLL responsible for the lock?

  1. Create a IFEO key for python/psutil. For example:
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\processhacker.exe]
"GlobalFlag"=dword:00000002

The debugger will show the loader output. DebugView from Sysinternals can help if python doesn't:

image

Looking at you changes here: 3eba2d3

The psutil_get_filename thread is doing absolutely nothing except calling NtQueryObject just like we're doing... NtQueryObject simply does a syscall instruction and exits so there's nothing there that can load a DLL onto that thread and terminating it shouldn't be an issue. There has to be something else going on here to be causing the problem unrelated to psutil_get_filename?

@xenu
Copy link

xenu commented Jul 19, 2023

(I don't use psutil, I found this thread in Google)

We had a similar problem in Perl, we needed a way to tell if a HANDLE is a named pipe without deadlocking. We had no luck with almost all of NtQueryInformationFile and NtQueryVolumeInformationFile classes, but we found one that works: FileVolumeNameInformation. It's a new API, it was added in Win10 and it's undocumented.

https://github.com/Perl/perl5/blob/caef5a7dffddb63ccd397171f834d8f5122356ee/win32/win32.c#L5279

@dmex
Copy link

dmex commented Aug 9, 2023

There's also a third option using the built-in Windows rundown support?

Does not require a kernel driver, opening handles to processes or handle duplication so it's able to support protected processes that would otherwise require a driver. Rundown is included with all versions of Windows and can be used to enumerate handles, threads, images (dlls), processes, heap etc...

image

Also supports stack traces (even stack traces from protected processes/objects/threads):

image

@PolarGoose
Copy link

PolarGoose commented Apr 26, 2024

@dmex,
What is rundown?
Could you please provide more information and a reference code?

Also, I have tried to use ReOpenFile as you proposed in your comment.
It works very well. But it only works for files, not folders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OS] Process.open_files() unsafe API usage can lead to deadlock
6 participants