gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

adr26 · 2023-05-09T14:17:22Z

Reinstate GIL cleanup during interpreter deletion, after all threads (including daemons, as per bpo-9901) have been destroyed.

This ensures the GIL is correctly destroyed before being re-initialised.

Issue: destroy_gil() is not called over multiple Py_Initialize() / Py_Finalize() calls #104324

bedevere-bot · 2023-05-09T14:17:28Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

cpython-cla-bot · 2023-05-09T14:17:48Z

All commit authors signed the Contributor License Agreement.

adr26 · 2023-05-10T15:32:17Z

Hi @ericsnowcurrently,

I was lulled into a false sense of security by the comments on zapthreads() saying that all the threads would be killed by that point, sorry(!)...

I checked the Azure failure and it's in test_threading. I ran it a couple more times and was able to reproduce the failure under windbg:

(32e8.8ca4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlpWaitOnCriticalSection+0xaa:
00007ffe`05823aba ff4024          inc     dword ptr [rax+24h] ds:00000000`00000024=????????
0:006> kn
 # Child-SP          RetAddr               Call Site
00 0000004d`e258f020 00007ffe`058118d4     ntdll!RtlpWaitOnCriticalSection+0xaa
01 0000004d`e258f130 00007ffe`058116c2     ntdll!RtlpEnterCriticalSectionContended+0x204
*** WARNING: Unable to verify checksum for C:\src\cpython\PCbuild\amd64\python312.dll
02 0000004d`e258f1b0 00007ffd`43f0bb0d     ntdll!RtlEnterCriticalSection+0x42
03 (Inline Function) --------`--------     python312!PyMUTEX_LOCK+0xb [C:\src\cpython\Python\condvar.h @ 125] 
04 0000004d`e258f1e0 00007ffd`43f0bd6b     python312!_PyCOND_WAIT_MS+0x3d [C:\src\cpython\Python\condvar.h @ 172] 
05 (Inline Function) --------`--------     python312!PyCOND_TIMEDWAIT+0x1f [C:\src\cpython\Python\condvar.h @ 201] 
06 0000004d`e258f210 00007ffd`43d72179     python312!take_gil+0xab [C:\src\cpython\Python\ceval_gil.c @ 381] 
07 (Inline Function) --------`--------     python312!PyEval_RestoreThread+0x11 [C:\src\cpython\Python\ceval_gil.c @ 711]

The daemon worker threads are blowing up inside COND_TIMED_WAIT():

    int drop_requested = 0;
    while (_Py_atomic_load_relaxed(&gil->locked)) {
        unsigned long saved_switchnum = gil->switch_number;

        unsigned long interval = (gil->interval >= 1 ? gil->interval : 1);
        int timed_out = 0;
        COND_TIMED_WAIT(gil->cond, gil->mutex, interval, timed_out);

        /* If we timed out and no switch occurred in the meantime, it is time
           to ask the GIL-holding thread to drop it. */
        if (timed_out &&
            _Py_atomic_load_relaxed(&gil->locked) &&
            gil->switch_number == saved_switchnum)
        {
            if (tstate_must_exit(tstate)) {
                MUTEX_UNLOCK(gil->mutex);
                // gh-96387: If the loop requested a drop request in a previous
                // iteration, reset the request. Otherwise, drop_gil() can
                // block forever waiting for the thread which exited. Drop
                // requests made by other threads are also reset: these threads
                // may have to request again a drop request (iterate one more
                // time).
                if (drop_requested) {
                    RESET_GIL_DROP_REQUEST(interp);
                }
                PyThread_exit_thread();
            }
            assert(is_tstate_valid(tstate));

            SET_GIL_DROP_REQUEST(interp);
            drop_requested = 1;
        }
    }

as the main thread has torn down the CS used as the GIL mutex after the worker's been woken, but before it tries to re-acquire the mutex:

0:006> ~0kn
 # Child-SP          RetAddr               Call Site
00 0000004d`e19ef420 00007ffe`058116c2     ntdll!RtlpEnterCriticalSectionContended+0xfa
01 0000004d`e19ef4a0 00007ffe`058f3b0c     ntdll!RtlEnterCriticalSection+0x42
02 0000004d`e19ef4d0 00007ffe`058ad0f0     ntdll!RtlDebugAllocateHeap+0xdc
03 0000004d`e19ef570 00007ffe`0582cd49     ntdll!RtlpAllocateHeap+0x7e4b0
04 0000004d`e19ef7d0 00007ffe`03a8f486     ntdll!RtlpAllocateHeapInternal+0x6c9
05 0000004d`e19ef8d0 00007ffe`03a8f3ab     WS2_32!DTHREAD::CreateDThreadForCurrentThread+0x46
06 0000004d`e19ef910 00007ffe`03a913e2     WS2_32!Prolog_v1+0x5b
07 0000004d`e19ef940 00007ffd`43f32ecb     WS2_32!WSACleanup+0x52
08 (Inline Function) --------`--------     python312!call_ll_exitfuncs+0x2e [C:\src\cpython\Python\pylifecycle.c @ 2971] 
09 0000004d`e19ef9c0 00007ffd`43f3507d     python312!Py_FinalizeEx+0x36b [C:\src\cpython\Python\pylifecycle.c @ 1960] 
0a 0000004d`e19efa30 00007ffd`43f5d9f4     python312!Py_Exit+0xd [C:\src\cpython\Python\pylifecycle.c @ 2981] 
0b (Inline Function) --------`--------     python312!handle_system_exit+0x446 [C:\src\cpython\Python\pythonrun.c @ 756] 
0c 0000004d`e19efa60 00007ffd`43f5ca37     python312!_PyErr_PrintEx+0x464 [C:\src\cpython\Python\pythonrun.c @ 765] 
0d 0000004d`e19efb00 00007ffd`43d3fba1     python312!_PyRun_SimpleFileObject+0x4d7 [C:\src\cpython\Python\pythonrun.c @ 440] 
0e (Inline Function) --------`--------     python312!_PyRun_AnyFileObject+0xa8 [C:\src\cpython\Python\pythonrun.c @ 78] 
0f 0000004d`e19efb90 00007ffd`43d406c1     python312!pymain_run_file_obj+0x291 [C:\src\cpython\Modules\main.c @ 360] 
10 (Inline Function) --------`--------     python312!pymain_run_file+0x7e [C:\src\cpython\Modules\main.c @ 379] 
11 0000004d`e19efc60 00007ffd`43d40a08     python312!pymain_run_python+0x571 [C:\src\cpython\Modules\main.c @ 610] 
12 0000004d`e19efce0 00007ffd`43d40a92     python312!Py_RunMain+0x18 [C:\src\cpython\Modules\main.c @ 691] 
*** WARNING: Unable to verify checksum for python.exe
13 (Inline Function) --------`--------     python312!pymain_main+0x4b [C:\src\cpython\Modules\main.c @ 719] 
14 0000004d`e19efd10 00007ff6`6d9a1230     python312!Py_Main+0x52 [C:\src\cpython\Modules\main.c @ 732] 
15 (Inline Function) --------`--------     python!invoke_main+0x22 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 90] 
16 0000004d`e19efd80 00007ffe`053c26ad     python!__scrt_common_main_seh+0x10c [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
17 0000004d`e19efdc0 00007ffe`0584a9f8     KERNEL32!BaseThreadInitThunk+0x1d
18 0000004d`e19efdf0 00000000`00000000     ntdll!RtlUserThreadStart+0x28

I can see how the idea of deferring the deletion of the GIL mutex/condition variable to the next Py_Initialize() solves the rundown of daemon threads in the case of a single interpreter / process exit case (where there isn't another Py_Initialize()!), but deferring the reinitialisation just kicks the can down the road and doesn't actually solve an inherent problem of sync'ing with daemon threads.

In the case where you do:

    Py_Initialize();
    <create daemon thread>;
    Py_FinalizeEx();
    Py_Initialize();

Say (as per the above worker callstack), a thread is signalled but then doesn't get to run until the main thread has called Py_FinalizeEx(); Py_Initialize();.

Now, ignoring the issue of whether destroy_gil() has been called or not (the question I originally opened this issue for!), the second Py_Initialize() will re-initialise the state of the mutex. But this will run concurrently (and not synchronised with) the daemon thread trying to acquire this same mutex.

It seems to me that the mutex and condition variable need to be allocated in some form of context which can be reference-counted. Daemon threads could own a reference to this context and then when they detect (via tstate_must_exit()) that they need to exit, they can release their reference to the context block. The last man standing could then free the underlying OS mutex & condition variable.

There is also a need to be able to call PyGILState_Ensure() (cf. this comment or bpo-20891) outside of the context of a Python thread. In this case, the responsibility to synchronise against Py_FinalizeEx() / Py_Initialize() is on the called (cf. PEP-311)...

Do you think this is a viable way forward, in order to resolve this crash?

Reinstate GIL cleanup during interpreter deletion.

507dd9a

bedevere-bot mentioned this pull request May 9, 2023

destroy_gil() is not called over multiple Py_Initialize() / Py_Finalize() calls #104324

Open

📜🤖 Added by blurb_it.

d5961c6

ericsnowcurrently self-requested a review May 9, 2023 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

adr26 commented May 9, 2023 •

edited by bedevere-bot

Loading

bedevere-bot commented May 9, 2023

cpython-cla-bot bot commented May 9, 2023 •

edited

Loading

adr26 commented May 10, 2023

gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

Are you sure you want to change the base?

gh-104324: Reinstate GIL cleanup during interpreter deletion. #104325

Conversation

adr26 commented May 9, 2023 • edited by bedevere-bot Loading

bedevere-bot commented May 9, 2023

cpython-cla-bot bot commented May 9, 2023 • edited Loading

adr26 commented May 10, 2023

adr26 commented May 9, 2023 •

edited by bedevere-bot

Loading

cpython-cla-bot bot commented May 9, 2023 •

edited

Loading