Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adr26
Copy link
Contributor

@adr26 adr26 commented May 9, 2023

Reinstate GIL cleanup during interpreter deletion, after all threads (including daemons, as per bpo-9901) have been destroyed.

This ensures the GIL is correctly destroyed before being re-initialised.

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented May 9, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@ericsnowcurrently ericsnowcurrently self-requested a review May 9, 2023 15:49
@adr26
Copy link
Contributor Author

adr26 commented May 10, 2023

Hi @ericsnowcurrently,

I was lulled into a false sense of security by the comments on zapthreads() saying that all the threads would be killed by that point, sorry(!)...

I checked the Azure failure and it's in test_threading. I ran it a couple more times and was able to reproduce the failure under windbg:

(32e8.8ca4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlpWaitOnCriticalSection+0xaa:
00007ffe`05823aba ff4024          inc     dword ptr [rax+24h] ds:00000000`00000024=????????
0:006> kn
 # Child-SP          RetAddr               Call Site
00 0000004d`e258f020 00007ffe`058118d4     ntdll!RtlpWaitOnCriticalSection+0xaa
01 0000004d`e258f130 00007ffe`058116c2     ntdll!RtlpEnterCriticalSectionContended+0x204
*** WARNING: Unable to verify checksum for C:\src\cpython\PCbuild\amd64\python312.dll
02 0000004d`e258f1b0 00007ffd`43f0bb0d     ntdll!RtlEnterCriticalSection+0x42
03 (Inline Function) --------`--------     python312!PyMUTEX_LOCK+0xb [C:\src\cpython\Python\condvar.h @ 125] 
04 0000004d`e258f1e0 00007ffd`43f0bd6b     python312!_PyCOND_WAIT_MS+0x3d [C:\src\cpython\Python\condvar.h @ 172] 
05 (Inline Function) --------`--------     python312!PyCOND_TIMEDWAIT+0x1f [C:\src\cpython\Python\condvar.h @ 201] 
06 0000004d`e258f210 00007ffd`43d72179     python312!take_gil+0xab [C:\src\cpython\Python\ceval_gil.c @ 381] 
07 (Inline Function) --------`--------     python312!PyEval_RestoreThread+0x11 [C:\src\cpython\Python\ceval_gil.c @ 711] 

The daemon worker threads are blowing up inside COND_TIMED_WAIT():

    int drop_requested = 0;
    while (_Py_atomic_load_relaxed(&gil->locked)) {
        unsigned long saved_switchnum = gil->switch_number;

        unsigned long interval = (gil->interval >= 1 ? gil->interval : 1);
        int timed_out = 0;
        COND_TIMED_WAIT(gil->cond, gil->mutex, interval, timed_out);

        /* If we timed out and no switch occurred in the meantime, it is time
           to ask the GIL-holding thread to drop it. */
        if (timed_out &&
            _Py_atomic_load_relaxed(&gil->locked) &&
            gil->switch_number == saved_switchnum)
        {
            if (tstate_must_exit(tstate)) {
                MUTEX_UNLOCK(gil->mutex);
                // gh-96387: If the loop requested a drop request in a previous
                // iteration, reset the request. Otherwise, drop_gil() can
                // block forever waiting for the thread which exited. Drop
                // requests made by other threads are also reset: these threads
                // may have to request again a drop request (iterate one more
                // time).
                if (drop_requested) {
                    RESET_GIL_DROP_REQUEST(interp);
                }
                PyThread_exit_thread();
            }
            assert(is_tstate_valid(tstate));

            SET_GIL_DROP_REQUEST(interp);
            drop_requested = 1;
        }
    }

as the main thread has torn down the CS used as the GIL mutex after the worker's been woken, but before it tries to re-acquire the mutex:

0:006> ~0kn
 # Child-SP          RetAddr               Call Site
00 0000004d`e19ef420 00007ffe`058116c2     ntdll!RtlpEnterCriticalSectionContended+0xfa
01 0000004d`e19ef4a0 00007ffe`058f3b0c     ntdll!RtlEnterCriticalSection+0x42
02 0000004d`e19ef4d0 00007ffe`058ad0f0     ntdll!RtlDebugAllocateHeap+0xdc
03 0000004d`e19ef570 00007ffe`0582cd49     ntdll!RtlpAllocateHeap+0x7e4b0
04 0000004d`e19ef7d0 00007ffe`03a8f486     ntdll!RtlpAllocateHeapInternal+0x6c9
05 0000004d`e19ef8d0 00007ffe`03a8f3ab     WS2_32!DTHREAD::CreateDThreadForCurrentThread+0x46
06 0000004d`e19ef910 00007ffe`03a913e2     WS2_32!Prolog_v1+0x5b
07 0000004d`e19ef940 00007ffd`43f32ecb     WS2_32!WSACleanup+0x52
08 (Inline Function) --------`--------     python312!call_ll_exitfuncs+0x2e [C:\src\cpython\Python\pylifecycle.c @ 2971] 
09 0000004d`e19ef9c0 00007ffd`43f3507d     python312!Py_FinalizeEx+0x36b [C:\src\cpython\Python\pylifecycle.c @ 1960] 
0a 0000004d`e19efa30 00007ffd`43f5d9f4     python312!Py_Exit+0xd [C:\src\cpython\Python\pylifecycle.c @ 2981] 
0b (Inline Function) --------`--------     python312!handle_system_exit+0x446 [C:\src\cpython\Python\pythonrun.c @ 756] 
0c 0000004d`e19efa60 00007ffd`43f5ca37     python312!_PyErr_PrintEx+0x464 [C:\src\cpython\Python\pythonrun.c @ 765] 
0d 0000004d`e19efb00 00007ffd`43d3fba1     python312!_PyRun_SimpleFileObject+0x4d7 [C:\src\cpython\Python\pythonrun.c @ 440] 
0e (Inline Function) --------`--------     python312!_PyRun_AnyFileObject+0xa8 [C:\src\cpython\Python\pythonrun.c @ 78] 
0f 0000004d`e19efb90 00007ffd`43d406c1     python312!pymain_run_file_obj+0x291 [C:\src\cpython\Modules\main.c @ 360] 
10 (Inline Function) --------`--------     python312!pymain_run_file+0x7e [C:\src\cpython\Modules\main.c @ 379] 
11 0000004d`e19efc60 00007ffd`43d40a08     python312!pymain_run_python+0x571 [C:\src\cpython\Modules\main.c @ 610] 
12 0000004d`e19efce0 00007ffd`43d40a92     python312!Py_RunMain+0x18 [C:\src\cpython\Modules\main.c @ 691] 
*** WARNING: Unable to verify checksum for python.exe
13 (Inline Function) --------`--------     python312!pymain_main+0x4b [C:\src\cpython\Modules\main.c @ 719] 
14 0000004d`e19efd10 00007ff6`6d9a1230     python312!Py_Main+0x52 [C:\src\cpython\Modules\main.c @ 732] 
15 (Inline Function) --------`--------     python!invoke_main+0x22 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 90] 
16 0000004d`e19efd80 00007ffe`053c26ad     python!__scrt_common_main_seh+0x10c [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
17 0000004d`e19efdc0 00007ffe`0584a9f8     KERNEL32!BaseThreadInitThunk+0x1d
18 0000004d`e19efdf0 00000000`00000000     ntdll!RtlUserThreadStart+0x28

I can see how the idea of deferring the deletion of the GIL mutex/condition variable to the next Py_Initialize() solves the rundown of daemon threads in the case of a single interpreter / process exit case (where there isn't another Py_Initialize()!), but deferring the reinitialisation just kicks the can down the road and doesn't actually solve an inherent problem of sync'ing with daemon threads.

In the case where you do:

    Py_Initialize();
    <create daemon thread>;
    Py_FinalizeEx();
    Py_Initialize();

Say (as per the above worker callstack), a thread is signalled but then doesn't get to run until the main thread has called Py_FinalizeEx(); Py_Initialize();.

Now, ignoring the issue of whether destroy_gil() has been called or not (the question I originally opened this issue for!), the second Py_Initialize() will re-initialise the state of the mutex. But this will run concurrently (and not synchronised with) the daemon thread trying to acquire this same mutex.

It seems to me that the mutex and condition variable need to be allocated in some form of context which can be reference-counted. Daemon threads could own a reference to this context and then when they detect (via tstate_must_exit()) that they need to exit, they can release their reference to the context block. The last man standing could then free the underlying OS mutex & condition variable.

There is also a need to be able to call PyGILState_Ensure() (cf. this comment or bpo-20891) outside of the context of a Python thread. In this case, the responsibility to synchronise against Py_FinalizeEx() / Py_Initialize() is on the called (cf. PEP-311)...

Do you think this is a viable way forward, in order to resolve this crash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants