-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show segfault in embed scoped_interpreter test #4500
Conversation
Interesting ... I don't have good ideas (but I'm also not an embedding expert, although I fixed up a thing here and there recently). We have lots of Another question: did this work before? With an older version of pybind11? If yes, when did it break? (git bisect?) But maybe the first thing to try: run in a debugger and look at the stack trace. Could you please try and post the stack trace here? |
hi,
It was working as of a yesterday. locally with the following
I get this stack trace
If I try a commit from Dec 1st 2022 it works
|
If I pop off the top two commits it no longer segfaults |
Then I'd say it's pretty much certain, #4459 is the culprit. That one came without a test. I'd say let's roll it back until we understand why it's causing the segfault here. |
Probably the trouble really started with #3744. Apparently that broke atexit, which led to #4459. The question to solve: how can we make this PR and atexit work? We need a volunteer to work on this, ideally this time around with full unit testing, even if it's not easy. @Skylion007 I'll create the rollback PR. |
…ybind#4459 (pybind#4486)" This reverts commit b2c1978. See pybind#4500 for background.
Do you want to merge this, or something similar to prevent these kinds of issues in future? Feel free to close otherwise. |
@PhilipDeegan Absolutely, if you could rebase and this doesn't SEGFAULT, then it would be a nice addition to the test suite (although please add a commit referencing the issue). |
So I did some digging through the stack trace and found what's going on a bit @rwgk. The issue is the detail::get_local_internals() cannot be called safely after Py_Finalize(): pybind11/include/pybind11/embed.h Line 259 in b8f2855
There is some serious nasty static variable behavior in our code right now. We construct local_internals and store it in a static variable. pybind11/include/pybind11/detail/internals.h Line 548 in b8f2855
The ctor for local_internals() has multiple calls to the Python API though which is not safe when the interpreter is not initialized! It seems in practice to be initialized on the first call to get_local_internals(), which with atexit fix is after Py_Finalize(). The first thing it will do is call get_internals() which will try to do a GIL lock before calling Python calls. It can't do this though because the interpreter isn't initialized! Hence the segfault: pybind11/include/pybind11/detail/internals.h Line 438 in b8f2855
The real issue is that the initial construction of local_internals is unsafe, UB and this is exposing it. The hotfix for this is just to put that previous change back in, but cache a reference to the output of get_local_internals() before Py_Finalize() is called then clear the C++ type_info after Py_Finalize(). Or safer yet, cache references to the C++ maps themselves and only clear them. |
Actually @PhilipDeegan @rwgk, I introduced your test case into our test suite and fixed the atexit bug in #4505. |
Hmm, I spoke too soon. Seems like some compilers are exploiting the UB for an optimization which is causing them to still be broken. |
Hey, we've noticed we're getting a segfault in the destructor of the scoped_interpreter.
I've updated a test of yours to force a full lifecycle of the scoped_interpreter class before anything else.
I've run this on my own fork and it does segfault.
It's possible I have misunderstood something so please let me know if this is not a smart thing to be doing.