[BUG]: Python API calls in destructors are unsafe outside the main thread #3274

jbms · 2021-09-15T19:19:38Z

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
Consider asking first in the Gitter chat room or in a Discussion.

Problem description

I observed some random crashes while the Python interpreter is exiting normally, when there are threads created from C++ that invoke Python code immediately before the exit (this thread "correctly" uses gil_scoped_acquire while invoking the Python code).

After some debugging, I believe the root cause is the following:

When Python exits the Py_FinalizeEx function is called. After doing some initial work with the Python interpreter still in a valid state, it initiates the real shutdown by calling _PyRuntimeState_SetFinalizing:
https://github.com/python/cpython/blob/09b4ad11f323f8702cde795e345b75e0fbb1a9a5/Python/pylifecycle.c#L1748
At this point, the main thread calling Py_FinalizeEx holds the GIL and therefore no other threads can be performing any Python API calls. However, once _PyRuntimeState_SetFinalizing is called, other threads are are no longer blocked waiting for the GIL. Instead, when they attempt to acquire the GIL, pthread_exit is called.
On Linux with GCC and glibc, pthread_exit appears to behave as if a C++ exception of type __forced_unwind is thrown.
That means that by default, if there are any C++ frames in the call stack, their destructors will run. In particular, if there are any pybind11::object local variables, their destructors will invoke Py_DECREF without holding the GIL, which can lead to a crash. If there are any noexcept functions in the call stack, std::terminate will be called, which terminates the program with an error.

This issue is mentioned in this Python bug report as well:
https://bugs.python.org/issue42969

I'm not sure what happens on platforms other than Linux with glibc and GCC, but I suspect the behavior may be similar.

As I see it, this is really a bug in Python itself, but due to pybind11's use of C++ destructors to call Py_DECREF, the problem is much more apparent when using pybind11 (or any similar C++ RAII wrapper). When using pure C code, the thread will likely exit without doing any cleanup, which may be harmless in most cases.

Unfortunately I don't see an easy workaround. We need to ensure Py_DECREF or any other Python APIs are not called from a destructor while Python is finalizing. There are two ways we could go about that:

a. Checking _Py_IsFinalizing() from the pybind1::object destructor (and just skip calling Py_DECREF in that case) would work I think. Unfortunately _Py_IsFinalizing is not inline, so the cost would not be negligible.
b. Alternatively, we could attempt to block the unwinding itself, and just make the thread hang. We could wrap the call to Python API functions with a try { } catch (...) {sleep_forever();} block. However, there are a very large number of Python C API functions that can potentially lead to a call to pthread_exit. In addition to the API calls that directly acquire the GIL, any API call that can trigger user Python code (including any call to Py_DECREF), could also trigger pthread_exit, because that user Python code may release and then re-acquire the GIL.

I think (a) would definitely be the most practical option.

We would also need to document the fact that users should not invoke any Python API functions from destructors without checking _Py_IsFinalizing.

Reproducible example code

No response

The text was updated successfully, but these errors were encountered:

Skylion007 · 2021-09-20T17:35:31Z

I think it sounds like a.) is the appropriate choice given this info. Unfortunate we need to add a noninline function on such a hot code path, but there doesn't seem to be a elegant way around it.

jbms · 2021-09-20T18:05:57Z

I have been exploring an alternative workaround: introduce a type of lock that can be acquired that blocks Python from exiting. This is implemented by registering an atexit function that sets a global variable indicating Python is exiting, which prevents future attempts to acquire an "exit lock", and then waits until any previously-acquired exit locks are released. This lock can be managed by an RAII type similar to gil_scoped_acquire, except that it has a bool member function that tells you whether the lock could be successfully acquired (i.e. that Python is not exiting). The atexit handlers run while the GIL still operates normally, just after Python waits for all non-daemon threads to exit, before Python goes into the "dangerous" finalization stage.

Then any time you have a C++ thread that wants to acquire the GIL in order to call Python API functions, it should first acquire this exit block. If successful, it can then acquire the GIL and proceed. If it cannot acquire the exit block because Python is exiting, it will have to skip calling the Python APIs. There is a possibility that some code running in the main Python thread is blocked, directly or indirectly, on whatever Python API the C++ thread would have liked to call. To avoid deadlock, an atexit function could be registered that somehow prevents such blocking, e.g. by ensuring an early error return instead.

This prevents C++ code from itself causing crashes. It is still possible for Python code running in a daemon thread to be calling a pybind11-bound function (which has directly or indirectly released the GIL) while Python exits. But that is at least more under the user's control.

In general, pthread_exit is just not very safe and I think it is best to try to avoid it being called at all, rather than trying to deal with the repercussions after it is called. On Windows, Python uses _endthreadex to terminate threads, and that does not call C++ destructors. That avoids crashes due to Py_DECREF calls but there is still the possibility of memory corruption due to references to data stored on the thread's stack.

jbms · 2021-09-24T18:12:54Z

See here for the Python pull request that attempts to address this problem by avoiding pthread_exit, and also adding a "finalize block" mechanism:

python/cpython#28525

rwgk · 2021-12-09T22:24:16Z

I think (a) would definitely be the most practical option.

That's exactly what we're doing in some other contexts (e.g. https://reviews.llvm.org/D114722).

Unfortunate we need to add a noninline function on such a hot code path,

We need performance data. More often than not, overhead from such things is practically undetectable. That's my default assumption here, with conclusion: let's add the _Py_IsFinalizing() check and wait to see if anyone notices.

jbms · 2021-12-09T22:52:49Z

There seems to be at least some agreement towards fixing the problem in Python itself, but my pull request has stalled:
python/cpython#28525

I ended up implementing the "exit block" mechanism In TensorStore as described in my previous comment, similar to the one implemented in the above Python pull request:
google/tensorstore@9ac3aa4

Even if we solve the specific issue of ~object, being pthread_exit-safe requires guarantees about your entire call stack (namely that every function in your call stack is also pthread_exit safe, and not marked noexcept) that are not really practical for library code to ensure.

rwgk · 2021-12-09T23:18:07Z

@jbms do you think we should keep this bug open anyway?

jbms · 2021-12-09T23:57:13Z

Maybe it is fine to close it given there is a python bug filed. Or we could close it once there is a resolution to the Python bug.

rwgk · 2021-12-10T00:04:08Z

Closing because it is not actionable here. It will still show up in searches.

jbms added the triage New bug, unverified label Sep 15, 2021

Skylion007 added bug help wanted labels Sep 19, 2021

rwgk closed this as completed Dec 10, 2021

lixinqi mentioned this issue Mar 16, 2022

Export env to python Oneflow-Inc/oneflow#7792

Merged

rwgk mentioned this issue Feb 10, 2023

FWD pybind11 google/pybind11clif#3274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Python API calls in destructors are unsafe outside the main thread #3274

[BUG]: Python API calls in destructors are unsafe outside the main thread #3274

jbms commented Sep 15, 2021

Skylion007 commented Sep 20, 2021

jbms commented Sep 20, 2021

jbms commented Sep 24, 2021

rwgk commented Dec 9, 2021

jbms commented Dec 9, 2021

rwgk commented Dec 9, 2021

jbms commented Dec 9, 2021

rwgk commented Dec 10, 2021

[BUG]: Python API calls in destructors are unsafe outside the main thread #3274

[BUG]: Python API calls in destructors are unsafe outside the main thread #3274

Comments

jbms commented Sep 15, 2021

Required prerequisites

Problem description

Reproducible example code

Skylion007 commented Sep 20, 2021

jbms commented Sep 20, 2021

jbms commented Sep 24, 2021

rwgk commented Dec 9, 2021

jbms commented Dec 9, 2021

rwgk commented Dec 9, 2021

jbms commented Dec 9, 2021

rwgk commented Dec 10, 2021