-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python/pystate.c:2218: _PyThreadState_PopFrame: Assertion `tstate->datastack_top >= base' failed. #93252
Comments
Spawning and killing lots of threads does suggest a race condition. Do you have the values of The question I want to answer is this: Has one of the internal pointers exceeded its bounds, or is it pointing to the wrong piece of memory? Thanks |
What's strange is that this always seems to happen at around the same time. I'm not entirely sure it's always the same number of repetitions, but it certainly always seems to be 50 ± 5 so far (spawning 3 threads each, from what I can see in gdb). Might well always be 47 times, it's somewhat hard to count since the process sending the commands runs independently.
Also,
Thanks for taking a look! Let me know if you need more information. Appreciate you taking a look, despite the lack of a simpler reproducer. |
Sorry, I didn't make myself clear. Could I get all the addresses, so I can see how they are inconsistent.
Thanks for helping. |
Sure:
>>> datastack_chunk = 0x7fffe4c06000
>>> base = 0x7ffff7821f10
>>> datastack_top = 0x7fffe4c06098
>>> datastack_limit = 0x7fffe4c0a000
>>>
>>> datastack_chunk < base
True
>>> base < datastack_top
False
>>> datastack_top < datastack_limit
True
|
It looks like the frame and thread's frame stack are both consistent, but that the frame belongs to another threadstate. Could you check whether the tstate's frame is the same as the frame. |
(same run as before, keeping it open as long as I don't need to reboot)
It is:
Yep!
|
@markshannon You (or someone who knows their way around this code more than I do) don't happen to be at PyConIT by any chance (or perhaps Europython, though that's still a bit farther away)? Would be happy to go for a debugging session there if that's an option, might be easier, given that the issue seems rather tricky to track down. |
I will be at EuroPython. Hopefully we will have fixed this before then |
Unfortunately I had to cancel my Europython attendance thanks to COVID, so I won't be available for any in-person debugging there 😢. If you'd be interested in a video call or something to debug this together on my machine, I'd be in for that, though. Or I can just continue to provide information here if you tell me what to look at - I have no idea what the involved code even does exactly, so I'm afraid I probably won't be able to find out anything interesting alone. |
Have you tested the current main branch? Have you tried tuning the gc thresholds to see if it makes it easier to trigger? |
The mian branch fails in the same way:
I don't think it's related to the gc in any way. I tried a |
@pablogsal should this perhaps be a deferred/release blocker? I realize I'm currently the only reporter seeing this, but it also happens with a Python 3.11 release build and is a crash happening during normal user usage - thus making qutebrowser unusable for daily usage on Python 3.11. |
Without a reproducer only involving CPython we sadly don't know if this is something on the dependencies or an extension or similar, unfortunately. If you or someone else can provide a simpler reproducer only using CPython code, that would help a lot to estimate the severity of this. |
Makes sense. Unfortunately I have no idea on where to even start with this (other than the bisecting/debugging above). All I can gather so far is that it's something about Python's internal frame objects and perhaps multiple threads... |
I understand, and thanks a lot for reporting this and the effort bisecting it. Sadly is very difficult for me as Release Manager to block on this if we cannot even reproduce it ourselves easily because this means that even if there is enough suspicion that something is missing it would be blocked indefinitely as we don't even know where the problem could be. Another possibility is to wait until other simpler code triggers the same kind of error. |
Based on the reproducer it may be that something of the thread state management is not protected by the GIL and is causing thread states to be mixed or re-used. But this is just a wild guess. |
The good news: I arrived at a much simpler reproducer after lots of tinkering. The bad news: It still requires either PyQt5 or PyQt6 to reproduce.... Here it is: from PyQt5.QtCore import QObject
def run():
obj = QObject()
for _ in range(202):
obj.destroyed.connect(lambda: None)
run() Stacktrace:
Some observations:
|
I'm marking this as a release blocker because ae0a2b7 should not affect C extensions and that's suspicious. Note that I'm still not sure this is not just some latent bug in PyQt that was just triggered by these changes, but it would be good to double check as this could be a more general issue. |
Btw thanks a lot @The-Compiler for working on a simpler reproducer! |
Much appreciated, thanks! I've been trying to figure out more, and there's one more thing that caught my eye: When adjusting the reproducer to print the Python stack: from PyQt5.QtCore import QObject
import traceback
def fun():
traceback.print_stack()
print("========")
def run():
obj = QObject()
for _ in range(169):
obj.destroyed.connect(fun)
run() then:
From what I understand, File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 13, in <module>
run()
File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 5, in fun
traceback.print_stack() but with Python 3.11, the stacktrace seems to point to the File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 13, in <module>
run()
File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 10, in run
for _ in range(169):
File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 5, in fun
traceback.print_stack() however, that might also be a red herring perhaps? When I check out ae0a2b7 (the first commit with the issue), I do see the crash as soon as I use |
Can you show the C++ stack at the time I'm having trouble installing PyQt5 in my M1 mac laptop unfortunately, so so far I cannot reproduce. I'm getting this error, for the curious:
|
By pure accident, I found a Python-only reproducer finally! Fingers crossed that it's in fact the same issue, but it does sure look like it: class Obj:
def __del__(): # sic!
pass
def run():
for _ in range(202):
obj = Obj()
run() As before, 201 works fine, 202 fails. And the traceback does again point to the for-loop, though I suppose it actually makes sense here, as that's where the old Exception ignored in: <function Obj.__del__ at 0x7f3881485b20>
Traceback (most recent call last):
File "/home/florian/tmp/python-crash/qutebrowser/repro.py", line 7, in run
for _ in range(202):
^^^^^^^^^^^^^^^^^^^^
TypeError: Obj.__del__() takes 0 positional arguments but 1 was given
python: Python/pystate.c:2201: _PyThreadState_PopFrame: Assertion `tstate->datastack_top >= base' failed. @pablogsal I believe PyQt5 does not have wheels for M1 macs. You should however be able to simply replace PyQt5 by PyQt6 in the imports and run again with C++ stack in
|
Fantastic, with this reproducer we can start working! This is indeed a release blocker, and is a bit scary how simple the code that crases is. Fantastic work producing the reproducer The-Compiler 🥇 |
Yeah, something’s wrong with the frame stack logic. Our stack of frames shouldn’t be growing at all, but it is, deeper and deeper, until it’s time to allocate a new chunk. Once we try to pop the next frame off after allocating that new chunk, It appears that we don't pop off the new def f():
pass
for _ in range(203):
try:
f(None)
except:
pass |
(So, in other words, this is really serious.) |
I think I found it. |
That seems correct. diff --git a/Python/ceval.c b/Python/ceval.c
index 0176002432..b4e2fee0a4 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -6410,7 +6410,7 @@ _PyEvalFramePushAndInit(PyThreadState *tstate, PyFunctionObject *func,
}
if (initialize_locals(tstate, func, localsarray, args, argcount, kwnames)) {
assert(frame->owner != FRAME_OWNED_BY_GENERATOR);
- _PyFrame_Clear(frame);
+ _PyEvalFrameClearAndPop(tstate, frame);
return NULL;
}
return frame;
|
@pablogsal, what if you crank up |
I tried up to 20000 and I also tried to decrement the recursion limit with no luck :( |
I wonder if your compiler can prove that the comparison is undefined for those two (different) buffers and skips it. Or something. |
Ugh, is quite unfortunate that we didn't catched this before :( I suppose there aren't many reason for the locals allocation to fail (on normal, working code that's passing the correct parameters to functions) other than funky destructors, but is a bit worrisome that our entire test suite never touched this path |
Thankfully it's easy to write a test for now. I'm testing out fixes on both branches. |
It's actually a bit tricky to hit. You need make N failing calls (enough to overflow the current 16KB stack chunk) and recover gracefully from all of them without exiting (successfully or otherwise) from the oldest frame where a failed call occurred. And even then, it doesn't actually crash until that oldest frame finally exits. |
Thanks @brandtbucher, you rock 🤘 |
One thing that's still unclear to me: How does the locals allocation fail in my original reproducer using FWIW I'm currently trying out the fix proposed by @kumaraditya303 above on top of b3, and it looks like the qutebrowser testsuite looks much better with it! ✨ (I also saw a weird SIGIOT when trying to run pytest with the latest 3.11 branch, but that's probably something for a new issue, if none exists yet...) |
Likely there is some C call that's failing and is raising unraisable exceptions or is cleaning the error indicator. These calls consume stacks that are never pop-ed but they are cleaned. If the original error doesn't happen again, it has to be something like this (fail to allocate the local stack) because otherwise that would indicate that we are failing to pop in other places. |
@pablogsal Out of curiosity, I tried to build with I will give #94693 a try somewhen over the next couple of days, I need to take a break after hours of reproducing bugs 😅. Finally, the pytest issue I have encountered while looking into this one: |
…94693) (cherry picked from commit 8a285df) Co-authored-by: Brandt Bucher <[email protected]>
…thonGH-94693) (cherry picked from commit 8a285df) Co-authored-by: Brandt Bucher <[email protected]>
The bug has been fixed in main and 3.11 branch. Thank you @brandtbucher and @kumaraditya303 |
I just tested the 3.11 fix, and can confirm everything seems to work properly now, including the qutebrowser testsuite (minus a few issues which are probably not CPython bugs). Thanks a lot, @markshannon @pablogsal @kumaraditya303 @brandtbucher @tiran! ❤️ |
Crash report
When running my application (qutebrowser) in a way it spawns/kills a lot of threads, after around 140 (?) threads spawned/killed, I get (depending on which commit I'm testing) either of:
Referring to:
cpython/Python/pystate.c
Line 2218 in 8d32a5c
I've spent hours on trying to find a more minimal reproducer, but unfortunately, the best I can offer without some help about what I could try next is this:
git clone https://github.com/qutebrowser/qutebrowser.git
cd qutebrowser
python3.11 -m venv .venv
.venv/bin/pip install -r requirements.txt -r misc/requirements/requirements-pyqt.txt
.venv/bin/python3 -m qutebrowser --temp-basedir -s tabs.last_close blank -s qt.chromium.sandboxing disable-seccomp-bpf
d
which will close the current tabu
which will reopen itcrasher.py
and make it executable.venv/bin/python3 -m qutebrowser --temp-basedir -s tabs.last_close blank -s qt.chromium.sandboxing disable-seccomp-bpf ":later 1000 spawn -u $PWD/crasher.py"
After some time (due to having to skip most commits because of #92112), I was able to bisect this to:
ae0a2b7 ("bpo-44590: Lazily allocate frame objects (GH-27077)", @markshannon)
Error messages
Stacktrace:
Debugging:
Happy to try more or report some debugging information, but I'm afraid I'm stuck at this point.
Your environment
The text was updated successfully, but these errors were encountered: