-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python crash related to thread race #671
Comments
Stack trace:
|
But the stack trace is not always the same; here is another one.
|
The second, long stack trace is on thread 1. The first, short stack is from thread 19. I'm thinking that python is (accidentally) being called from multiple threads, without a lock. |
Adding additional locks to the PythonEval ctor and dtor has no effect. I reviewed all locks, and they seem OK to me. |
Googling this error message suggests to not call |
@inflector if you can review this, that would be great. If I stub out the I don't understand why it did not dead-lock immediately. Stubbing out |
@linas I haven't played with ROS at all so far. It may take me a bit to get it setup. Is rospy written entirely in Python? Does it use any C/C++ or Cython that we wrote that's being linked in dynamically? Or it is just a Python wrapper for an ROS library? (EDIT: No need to answer these now as I've already figured this out.) |
One thing that does come to mind is that you need to call PyEval_InitThreads(); in every thread that's not a Python thread that calls into Python's C API. Failing to do that will give you the tstate crash. Who is creating the threads in rospy? Are some of them non-Python threads that ROS generates via C++ for example? |
And the suggestion to not release the GIL lock only applies to Python module definitions, i.e. extending Python with C++, not to the case of initializing threads in C++ for a system that will be using the Python C API to embed Python. |
It's also possible that in OpenCog we're running an atypical case with rospy, one that may not even work reliably. That is an embedded Python interpreter importing and running rospy rather than a Python program doing so. |
You realize I'm talking about https://github.com/opencog/atomspace/blob/master/opencog/cython/PythonEval.cc right? It does seem to call I doubt that the bug has anything to directly to do with ROS -- there is no mention of ROS in the stack frames. I suspect that the error can be reproduced with other python code... |
Hmmm, it doesn't appear that this path (i.e. guile from a linux command prompt loading PythonEval via an SCM module) provides a place for our PythonEval system to get initialized. This use of Python from guile outside of a CogServer appears to be new in the last year. I don't recall that being possible before. In particular, if you call guile from the unix prompt, rather than inside of the CogServer, for instance, then The comment here in PythonSCM.cc hints at the problem. So the guile/scheme module initialization code that gets called when doing:
probably:
needs to call While you are fixing that, I'd remove the incorrect comments inside
to:
or something like that. |
I'd fix this myself, but I'm not sure which of the scheme module initialization routines is the right one. |
Its also a heisenbug: if, after the |
I take that back ... it didn't crash, because I had a syntax error (tabs not spaces) Once I fixed the whitespace, it starts crashing again. |
Anyway, this has nothing to do with scheme. I assume it's reproducible just by called PythonEval::insance() from a C++ main() and moving forward from there. |
Here, the below crashes, with no guile in it, at all: its just pure opencog-python:
It crashes with the same collection of error messages, and the same typical stack traces (except, of course, the guile part is not in the stack trace any more.) |
Yes, I was focusing on what was different about this test case path. The first thing that stood out was rospy.
It was the missing However, I just noticed that you added a call to |
I can't reproduce the error by using something other than rospy, although I did not try very hard, because I don't have any complex python code laying around that I could experiment with. It is somehow related to the execfile. Without the execfile, there is no problem. For example, the following works (if the cogserver is started):
|
Replacing |
I'm betting that it's the interaction between |
I believe the problem is related to the thread spawning in
this appears to be something you can't do while you have the GIL lock which you must have in order to invoke the interpreter. I've simplified this down by going straight to the Python C api calls required to call the rospy code. So this reproduces the bug without any OpenCog-specific code whatsoever, and no external files:
The above duplicates the problem. I believe it comes down to spawning a new thread while already holding the GIL and then not returning to the same Python thread that was active when the gill was being held. I tried putting I'm not at all sure why you can call the same code from the python prompt and it works fine. |
I've posted a question to "answers.ros.org": http://answers.ros.org/question/227691/crash-when-calling-init_node-while-holding-gil/ Let's see if anyone on the ros team has seen this before. |
If I comment out the PyEval_ReleaseLock(); in the above, then it doesn't crash! |
... the example doesn't crash, but in the 'regular' code, it results in a deadlock... |
Replacing |
surrounding PySys_SetArgvEx() byt GLI state ensure moves the crash down the line, to |
woo-hoo its fixed! Two fixes: one to use PyEval_SaveThread(); instead of PyEval_ReleaseLock() in the atomspace, and a second to surround the opencog module with a GIL |
This is part 2 of fix to bug opencog/atomspace#671
Using PyEval_SaveThread(); instead of PyEval_ReleaseLock() uncovered multiple locations where GIL locks were missing. I think I got them all, now. All pull requests have merged. The code that originally provoked this now works! Yayyyyy! |
p.s. thanks inflector, this would have remained unsolved without the shortened version! |
I posted a note on ROS Answers. There are two additional issues one needs to worry about.
http://wiki.ros.org/rospy/Overview/Initialization%20and%20Shutdown#Testing_for_shutdown
I can take care of 2), but doing 1) will require some sort of exit hook function registration system analogous to rospy's |
The funny thing is that I had tried Your finding it to work for your case gave me the impetus to try it again and work through the other issues with the shortened sample code. |
I did an unclean hack for #2 already, it seems to work fine. For #1 I think its OK to punt. The current behavior node does a while not rospy.is_shutdown() i.e effectively runs an infinite loop in the atomspace. If I exit guile or kill the cogserver, it all comes crashing down, which is maybe not pretty but its acceptable. Clean shutdowns need to be deferred until we gt an actual agent deisgn, instead of the current kludge. |
Calling |
I personally have an issue with crashing as well but when I am embedding the interpreter when I try to convert the int
#ifdef _WIN32
wmain(int argc, wchar_t *argv[])
#else
main(int argc, char *argv[])
#endif
{
int err;
wchar_t *program = Py_DecodeLocale(argv[0], NULL);
if (program == NULL) {
fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
exit(1);
}
Py_SetProgramName(program); /* optional but recommended */
Py_Initialize();
int initialized = Py_IsInitialized();
if (initialized != 0) {
#ifdef _WIN32
/*
allows use of sys.argv to be possible
with no tracebacks.
Note: This is not possible to use in any other
platform for now until someone helps patch
this for me (I am not sure what to do to
make it work like this on them).
*/
PySys_SetArgvEx(argc, argv, 0);
// code that reads the script from the PE resource section and runs it.
#else
//crash here when I try to use it the wchar_t * version of argv on platforms not Windows.
#endif
//snip. |
Hi @AraHaan are you sure you are in the right place? I'm pretty sure that OpenCog does not compile on Windows, and I know that we do not initialize python in main, -- it is a loadable module that gets loaded only if you want to use python. Also, none of our code has been ported to python 3.5 yet -- we need someone knowledgable in python to do this. And finally, we do not use wchar anywhere, we use UTF8 exclusively for all strings. |
Well for python 3.5+ |
We don't call |
Here is a bug is bugging me. It requires three files to reproduce.
File
f.scm
contains:file
b.py
contains:file
rrr.py
contains:The, do this:
guile -l f.scm
and it crashes with:or with
or with
but usually it just crashes with no message at all.
The text was updated successfully, but these errors were encountered: