Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python crash related to thread race #671

Closed
linas opened this issue Feb 26, 2016 · 38 comments
Closed

Python crash related to thread race #671

linas opened this issue Feb 26, 2016 · 38 comments
Labels

Comments

@linas
Copy link
Member

linas commented Feb 26, 2016

Here is a bug is bugging me. It requires three files to reproduce.

File f.scm contains:

(use-modules (opencog) (opencog python))
(python-eval "execfile('b.py')")

file b.py contains:

from rrr import Grrr
Grrr()

file rrr.py contains:

import rospy
class Grrr():
   def __init__(self):
      # The below will hang until roscore is started!
      rospy.init_node("OpenCog_Eva")

The, do this: guile -l f.scm and it crashes with:

Fatal Python error: GC object already tracked

or with

Fatal Python error: ceval: tstate mix-up

or with

Fatal Python error: PyEval_AcquireThread: non-NULL old thread state

but usually it just crashes with no message at all.

@linas linas changed the title Python crash heisenbug Python crash Feb 26, 2016
@linas
Copy link
Member Author

linas commented Feb 26, 2016

Stack trace:

(gdb) bt
#0  PyFrame_New (tstate=tstate@entry=0x0, code=code@entry=0x7fffe1f1a930, 
    globals=0x7fffe1f136e0, locals=locals@entry=0x0)
    at ../Objects/frameobject.c:627
#1  0x00007fffe351ddb4 in PyEval_EvalCodeEx (co=0x7fffe1f1a930, 
    globals=<optimized out>, locals=locals@entry=0x0, args=0x7fffe119ace8, 
    argcount=argcount@entry=1, kws=0x7fffe119acf0, kwcount=0, defs=0x0, 
    defcount=0, closure=closure@entry=0x0) at ../Python/ceval.c:3033
#2  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=1, n=1, 
    pp_stack=0x7fffe1182a80, func=0x7fffe157e320) at ../Python/ceval.c:4116
#3  call_function (oparg=<optimized out>, pp_stack=0x7fffe1182a80)
    at ../Python/ceval.c:4041
#4  PyEval_EvalFrameEx (f=f@entry=0x7fffe119ab60, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#5  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, 
    args=args@entry=0x7fffe3050ea8, argcount=1, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=0x0) at ../Python/ceval.c:3252
#6  0x00007fffe35536d0 in function_call (func=0x7fffe157e230, 
    arg=0x7fffe3050e90, kw=0x0) at ../Objects/funcobject.c:526
#7  0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe157e230, 
    arg=arg@entry=0x7fffe3050e90, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#8  0x00007fffe344b7bd in instancemethod_call.8988 (func=0x7fffe157e230, 
---Type <return> to continue, or q <return> to quit--- 
    arg=0x7fffe3050e90, kw=0x0) at ../Objects/classobject.c:2602
#9  0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe1190050, 
    arg=arg@entry=0x7fffe314d050, kw=<optimized out>)
    at ../Objects/abstract.c:2529
#10 0x00007fffe3538577 in PyEval_CallObjectWithKeywords (func=0x7fffe1190050, 
    arg=0x7fffe314d050, kw=<optimized out>) at ../Python/ceval.c:3889
#11 0x00007fffe34a0f92 in t_bootstrap.73787 (boot_raw=0x1091710)
    at ../Modules/threadmodule.c:614
#12 0x00007ffff789b182 in start_thread (arg=0x7fffe1183700)
    at pthread_create.c:312
#13 0x00007ffff75c847d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@linas
Copy link
Member Author

linas commented Feb 26, 2016

But the stack trace is not always the same; here is another one.

Fatal Python error: ceval: tstate mix-up

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe3518538 in PyEval_EvalFrameEx (f=f@entry=0x7fffe12286f0, 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:874
874 ../Python/ceval.c: No such file or directory.
(gdb) bt
#0  0x00007fffe3518538 in PyEval_EvalFrameEx (f=f@entry=0x7fffe12286f0, 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:874
#1  0x00007fffe351d059 in fast_function (nk=<optimized out>, na=0, n=0, 
    pp_stack=0x7fffffffc1b0, func=0x7fffe120c050) at ../Python/ceval.c:4106
#2  call_function (oparg=<optimized out>, pp_stack=0x7fffffffc1b0)
    at ../Python/ceval.c:4041
#3  PyEval_EvalFrameEx (f=f@entry=0x7fffe118c450, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#4  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, 
    argcount=argcount@entry=2, kws=0x141a710, kwcount=2, defs=0x7fffe1243888, 
    defcount=3, closure=closure@entry=0x0) at ../Python/ceval.c:3252
#5  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=2, n=6, 
    pp_stack=0x7fffffffc3d0, func=0x7fffe11c99b0) at ../Python/ceval.c:4116
#6  call_function (oparg=<optimized out>, pp_stack=0x7fffffffc3d0)
    at ../Python/ceval.c:4041
#7  PyEval_EvalFrameEx (f=f@entry=0x141a510, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#8  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, 
    argcount=argcount@entry=1, kws=0x7fffe31311e0, kwcount=0, 
    defs=0x7fffe11f13b0, defcount=8, closure=closure@entry=0x0)
    at ../Python/ceval.c:3252
#9  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=1, n=1, 
---Type <return> to continue, or q <return> to quit---
    pp_stack=0x7fffffffc5f0, func=0x7fffe11f1cf8) at ../Python/ceval.c:4116
#10 call_function (oparg=<optimized out>, pp_stack=0x7fffffffc5f0)
    at ../Python/ceval.c:4041
#11 PyEval_EvalFrameEx (f=f@entry=0x7fffe3131050, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#12 0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, 
    args=args@entry=0x7fffe3050de8, argcount=1, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=0x0) at ../Python/ceval.c:3252
#13 0x00007fffe35536d0 in function_call (func=0x7fffe11f3c80, 
    arg=0x7fffe3050dd0, kw=0x0) at ../Objects/funcobject.c:526
#14 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe11f3c80, 
    arg=arg@entry=0x7fffe3050dd0, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#15 0x00007fffe344b7bd in instancemethod_call.8988 (func=0x7fffe11f3c80, 
    arg=0x7fffe3050dd0, kw=0x0) at ../Objects/classobject.c:2602
#16 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe1a3adc0, 
    arg=arg@entry=0x7fffe314d050, kw=<optimized out>)
    at ../Objects/abstract.c:2529
#17 0x00007fffe3538577 in PyEval_CallObjectWithKeywords (
    func=func@entry=0x7fffe1a3adc0, arg=arg@entry=0x7fffe314d050, 
    kw=kw@entry=0x0) at ../Python/ceval.c:3889
#18 0x00007fffe3474988 in PyInstance_New (klass=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    arg=0x7fffe314d050, kw=0x0) at ../Objects/classobject.c:581
#19 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe302dc80, 
    arg=arg@entry=0x7fffe314d050, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#20 0x00007fffe351a816 in do_call (nk=<optimized out>, na=0, 
    pp_stack=0x7fffffffcc00, func=0x7fffe302dc80) at ../Python/ceval.c:4238
#21 call_function (oparg=<optimized out>, pp_stack=0x7fffffffcc00)
    at ../Python/ceval.c:4043
#22 PyEval_EvalFrameEx (f=f@entry=0x7fffe305b200, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#23 0x00007fffe351e54d in PyEval_EvalCodeEx (co=co@entry=0x7fffe30517b0, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70, 
    args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=closure@entry=0x0) at ../Python/ceval.c:3252
#24 0x00007fffe351e682 in PyEval_EvalCode (co=co@entry=0x7fffe30517b0, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70)
    at ../Python/ceval.c:667
#25 0x00007fffe3553882 in run_mod (arena=0x104c0a0, flags=0x0, 
    locals=0x7fffe311bd70, globals=0x7fffe311bd70, filename=<optimized out>, 
    mod=<optimized out>) at ../Python/pythonrun.c:1370
#26 PyRun_FileExFlags (fp=fp@entry=0x1028d80, filename=<optimized out>, 
    start=start@entry=257, globals=0x7fffe311bd70, locals=0x7fffe311bd70, 
    closeit=closeit@entry=1, flags=0x0) at ../Python/pythonrun.c:1356
---Type <return> to continue, or q <return> to quit---
#27 0x00007fffe3553a59 in builtin_execfile (self=<optimized out>, 
    args=<optimized out>) at ../Python/bltinmodule.c:803
#28 0x00007fffe351d0d4 in call_function (oparg=<optimized out>, 
    pp_stack=0x7fffffffcfa0) at ../Python/ceval.c:4020
#29 PyEval_EvalFrameEx (f=f@entry=0x7fffe305b050, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#30 0x00007fffe351e54d in PyEval_EvalCodeEx (co=co@entry=0x7fffe3051730, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70, 
    args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=closure@entry=0x0) at ../Python/ceval.c:3252
#31 0x00007fffe351e682 in PyEval_EvalCode (co=co@entry=0x7fffe3051730, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70)
    at ../Python/ceval.c:667
#32 0x00007fffe35184b9 in run_mod (arena=0x109a720, flags=<optimized out>, 
    locals=0x7fffe311bd70, globals=0x7fffe311bd70, 
    filename=0x7fffe35e9f2b "<string>", mod=<optimized out>)
    at ../Python/pythonrun.c:1370
#33 PyRun_StringFlags (str=<optimized out>, start=257, globals=0x7fffe311bd70, 
    locals=0x7fffe311bd70, flags=<optimized out>) at ../Python/pythonrun.c:1333
#34 0x00007fffe3935f63 in opencog::PythonEval::execute_string (
    this=<optimized out>, command=0x10307e8 "execfile('b.py')\n")
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:618
---Type <return> to continue, or q <return> to quit---
#35 0x00007fffe393370c in opencog::PythonEval::apply_script (
    this=this@entry=0xb4d970, script=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:938
#36 0x00007fffe3933c18 in opencog::PythonEval::eval_expr (this=0xb4d970, 
    partial_expr=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:1279
#37 0x00007fffe3b421e1 in eval (expr=..., this=0xb4d970)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.h:172
#38 opencog::PythonSCM::eval (this=<optimized out>, pystr=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonSCM.cc:112
#39 0x00007fffe3b3ecbb in opencog::SchemePrimitive<opencog::PythonSCM>::invoke(scm_unused_struct*) [clone .local.0] (this=0xb4d590, args=0xf91070)
    at /home/linas/src/novamente/src/atomspace-git/opencog/guile/SchemePrimitive.h:440
#40 0x00007fffeefc1694 in opencog::PrimitiveEnviron::do_call (sfe=0xf8ab60, 
    arglist=0xf91070)
    at /home/linas/src/novamente/src/atomspace-git/opencog/guile/SchemePrimitive.cc:146
#41 0x00007ffff7b6df53 in vm_debug_engine (thread=0x7fffe12286f0, vp=0x790d80, 
    registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=0)
---Type <return> to continue, or q <return> to quit---
    at ../../libguile/vm-engine.c:800
#42 0x00007ffff7b73969 in scm_call_n (proc=0xa564a0, argv=0x0, nargs=0)
    at ../../libguile/vm.c:1236
#43 0x00007ffff7b6df53 in vm_debug_engine (thread=0x7fffe12286f0, vp=0x790d80, 
    registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=2)
    at ../../libguile/vm-engine.c:800
#44 0x00007ffff7b73969 in scm_call_n (proc=0x7ffff7fe0030, 
    argv=argv@entry=0x7fffffffd738, nargs=nargs@entry=1)
    at ../../libguile/vm.c:1236
#45 0x00007ffff7afddf7 in scm_primitive_eval (exp=exp@entry=0x842910)
    at ../../libguile/eval.c:654
#46 0x00007ffff7afde53 in scm_eval (exp=0x842910, 
    module_or_state=module_or_state@entry=0x8242d0)
    at ../../libguile/eval.c:688
#47 0x00007ffff7b450dd in scm_shell (argc=3, argv=0x7fffffffdd78)
    at ../../libguile/script.c:454
#48 0x00007ffff7b1543d in invoke_main_func (body_data=0x7fffffffdc20)
    at ../../libguile/init.c:338
#49 0x00007ffff7af7e1a in c_body (d=0x7fffffffdb60)
    at ../../libguile/continuations.c:429
#50 0x00007ffff7b70fc2 in vm_regular_engine (thread=0x7fffe12286f0, 
    vp=0x790d80, registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=10)
    at ../../libguile/vm-engine.c:800
#51 0x00007ffff7b73969 in scm_call_n (proc=proc@entry=0x836ca0, 
---Type <return> to continue, or q <return> to quit---
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1236
#52 0x00007ffff7afcc69 in scm_call_0 (proc=proc@entry=0x836ca0)
    at ../../libguile/eval.c:473
#53 0x00007ffff7b65212 in catch (tag=tag@entry=0x404, thunk=0x836ca0, 
    handler=0x836c80, pre_unwind_handler=0x836c60)
    at ../../libguile/throw.c:135
#54 0x00007ffff7b65645 in scm_catch_with_pre_unwind_handler (
    key=key@entry=0x404, thunk=<optimized out>, handler=<optimized out>, 
    pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:249
#55 0x00007ffff7b656ff in scm_c_catch (tag=tag@entry=0x404, 
    body=body@entry=0x7ffff7af7e10 <c_body>, 
    body_data=body_data@entry=0x7fffffffdb60, 
    handler=handler@entry=0x7ffff7af81a0 <c_handler>, 
    handler_data=handler_data@entry=0x7fffffffdb60, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7ffff7af7f50 <pre_unwind_handler>, pre_unwind_handler_data=pre_unwind_handler_data@entry=0x7f14a0)
    at ../../libguile/throw.c:370
#56 0x00007ffff7af84ef in scm_i_with_continuation_barrier (
    body=body@entry=0x7ffff7af7e10 <c_body>, 
    body_data=body_data@entry=0x7fffffffdb60, 
    handler=handler@entry=0x7ffff7af81a0 <c_handler>, 
    handler_data=handler_data@entry=0x7fffffffdb60, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7ffff7af7f50 <pre_unwind_handler>, pre_unwind_handler_data=0x7f14a0) at ../../libguile/continuations.c:367
---Type <return> to continue, or q <return> to quit---
#57 0x00007ffff7af8585 in scm_c_with_continuation_barrier (
    func=<optimized out>, data=<optimized out>)
    at ../../libguile/continuations.c:463
#58 0x00007ffff7b631ac in with_guile_and_parent (
    base=base@entry=0x7fffffffdbc0, data=data@entry=0x7fffffffdbf0)
    at ../../libguile/threads.c:786
#59 0x00007ffff7185812 in GC_call_with_stack_base (
    fn=fn@entry=0x7ffff7b63160 <with_guile_and_parent>, 
    arg=arg@entry=0x7fffffffdbf0) at misc.c:1925
#60 0x00007ffff7b63668 in scm_i_with_guile_and_parent (parent=<optimized out>, 
    data=data@entry=0x7fffffffdbf0, 
    func=func@entry=0x7ffff7b15420 <invoke_main_func>)
    at ../../libguile/threads.c:829
#61 scm_with_guile (func=func@entry=0x7ffff7b15420 <invoke_main_func>, 
    data=data@entry=0x7fffffffdc20) at ../../libguile/threads.c:835
#62 0x00007ffff7b155f2 in scm_boot_guile (argc=argc@entry=3, 
    argv=argv@entry=0x7fffffffdd78, 
    main_func=main_func@entry=0x400a60 <inner_main>, closure=closure@entry=0x0)
    at ../../libguile/init.c:321
#63 0x0000000000400910 in main (argc=3, argv=0x7fffffffdd78)
    at ../../libguile/guile.c:101
(gdb) 

@linas
Copy link
Member Author

linas commented Feb 26, 2016

The second, long stack trace is on thread 1. The first, short stack is from thread 19. I'm thinking that python is (accidentally) being called from multiple threads, without a lock.

@linas linas changed the title Python crash Python crash related to thread race Feb 26, 2016
@linas
Copy link
Member Author

linas commented Feb 26, 2016

Adding additional locks to the PythonEval ctor and dtor has no effect. I reviewed all locks, and they seem OK to me.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

Googling this error message suggests to not call PyEval_ReleaseLock()
http://stackoverflow.com/questions/8451334/why-is-pygilstate-release-throwing-fatal-python-errors

@linas
Copy link
Member Author

linas commented Feb 26, 2016

@inflector if you can review this, that would be great. If I stub out the PyEval_ReleaseLock the test case above works. A more complex test sort-of works ... a bunch of python does get executed, some rather complex files load, and they do all kids of python ROS stuff, and python prints do print out various messages, just fine, .. .for a while. But after a few seconds it deadlocks in opencog::PythonEval hangs, waiting on some python lock.

I don't understand why it did not dead-lock immediately. Stubbing out PyEval_ReleaseLock() does not deadlock python right away, it only does it after a while ...

@linas linas added the python label Feb 26, 2016
@inflector
Copy link
Member

@linas I haven't played with ROS at all so far. It may take me a bit to get it setup.

Is rospy written entirely in Python? Does it use any C/C++ or Cython that we wrote that's being linked in dynamically? Or it is just a Python wrapper for an ROS library?

(EDIT: No need to answer these now as I've already figured this out.)

@inflector
Copy link
Member

One thing that does come to mind is that you need to call PyEval_InitThreads(); in every thread that's not a Python thread that calls into Python's C API. Failing to do that will give you the tstate crash. Who is creating the threads in rospy? Are some of them non-Python threads that ROS generates via C++ for example?

@inflector
Copy link
Member

And the suggestion to not release the GIL lock only applies to Python module definitions, i.e. extending Python with C++, not to the case of initializing threads in C++ for a system that will be using the Python C API to embed Python.

@inflector
Copy link
Member

It's also possible that in OpenCog we're running an atypical case with rospy, one that may not even work reliably. That is an embedded Python interpreter importing and running rospy rather than a Python program doing so.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

You realize I'm talking about https://github.com/opencog/atomspace/blob/master/opencog/cython/PythonEval.cc right? It does seem to call PyEval_InitThreads() and PyGILState_Ensure() in all the right places.

I doubt that the bug has anything to directly to do with ROS -- there is no mention of ROS in the stack frames. I suspect that the error can be reproduced with other python code...

@inflector
Copy link
Member

Hmmm, it doesn't appear that this path (i.e. guile from a linux command prompt loading PythonEval via an SCM module) provides a place for our PythonEval system to get initialized. This use of Python from guile outside of a CogServer appears to be new in the last year. I don't recall that being possible before.

In particular, if you call guile from the unix prompt, rather than inside of the CogServer, for instance, then global_python_initialize() will never get called so when you use a PythonEval object it won't have been properly initialized. Our previous usages of Python via a guile prompt were all done inside the CogServer where global_python_initialize() is being called during initialization.

The comment here in PythonSCM.cc hints at the problem.

So the guile/scheme module initialization code that gets called when doing:

(use-modules (opencog) (opencog python))

probably:

void PythonSCM::init(void)

needs to call global_python_initialize().

While you are fixing that, I'd remove the incorrect comments inside global_python_initialize() in PythonEval.cc, i.e change:

    // Throw an exception if this is called more than once.
    if (already_initialized) {
        return;
        throw opencog::RuntimeException(TRACE_INFO,
            "Python initializer global_python_init() called twice.");
    }

to:

    // If we've already initialized then there is nothing to do...
    if (already_initialized)
        return;

or something like that.

@inflector
Copy link
Member

I'd fix this myself, but I'm not sure which of the scheme module initialization routines is the right one.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

Its also a heisenbug: if, after the rospy.init call, I call os.system("ls") then the above example does not crash. Something about the additional time delay? Or something else?

@linas
Copy link
Member Author

linas commented Feb 26, 2016

I take that back ... it didn't crash, because I had a syntax error (tabs not spaces) Once I fixed the whitespace, it starts crashing again.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

Anyway, this has nothing to do with scheme. I assume it's reproducible just by called PythonEval::insance() from a C++ main() and moving forward from there.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

Here, the below crashes, with no guile in it, at all: its just pure opencog-python:

#include <string>
#include <opencog/cython/PythonEval.h>
using namespace opencog;

int main(int argc, char* argv[])
{
   PythonEval& pyev = PythonEval::instance();
   std::string rv;

   std::string pystr = "execfile('b.py')";
   rv = pyev.eval(pystr);
   printf ("its %s\n", rv.c_str());
   return 0;
}

It crashes with the same collection of error messages, and the same typical stack traces (except, of course, the guile part is not in the stack trace any more.)

@inflector
Copy link
Member

You realize I'm talking about PythonEval.cc right?

Yes, I was focusing on what was different about this test case path. The first thing that stood out was rospy.

this has nothing to do with scheme

It was the missing global_python_initialize that I thought was the problem, not guile per se.

However, I just noticed that you added a call to global_python_initialize inside the PythonEval constructor last May. The presence of that call changes things since the creation of the singleton should now also call global_python_initialize during the constructor for the singleton.

@linas
Copy link
Member Author

linas commented Feb 26, 2016

I can't reproduce the error by using something other than rospy, although I did not try very hard, because I don't have any complex python code laying around that I could experiment with.

It is somehow related to the execfile. Without the execfile, there is no problem. For example, the following works (if the cogserver is started):

echo "py" | cat - b.py |netcat localhost 17001

@linas
Copy link
Member Author

linas commented Feb 26, 2016

Replacing std::string pystr = "execfile('b.py')"; by std::string pystr = "import b"; results in the same crash.

@inflector
Copy link
Member

I'm betting that it's the interaction between execfile and rospy spawning some python threads to do the connections. I've got rospy installed and am reproducing the problem on this end now too, and trying to narrow down the test case.

@inflector
Copy link
Member

I believe the problem is related to the thread spawning in rospy.impl.init start_node which calls:

    node = rosgraph.xmlrpc.XmlRpcNode(port, handler, on_run_error=_node_run_error)
    node.start()

node.start in turn calls:

    def start(self):
        """
        Initiate a thread to run the XML RPC server. Uses thread.start_new_thread.
        """
        _thread.start_new_thread(self.run, ())

this appears to be something you can't do while you have the GIL lock which you must have in order to invoke the interpreter.

I've simplified this down by going straight to the Python C api calls required to call the rospy code. So this reproduces the bug without any OpenCog-specific code whatsoever, and no external files:

#include <Python.h>

int main(int argc, char* argv[])
{
    Py_InitializeEx(0);
    PyEval_InitThreads();

    static const char *argv0 = "test";
    PySys_SetArgv(1, (char **) &argv0);

    PyEval_ReleaseLock();

    PyGILState_STATE gstate;
    gstate = PyGILState_Ensure();

    PyRun_SimpleString("import rospy");
    printf ("import rospy - successful\n");

    PyRun_SimpleString("rospy.init_node('OpenCog_Eva')");
    PyGILState_Release(gstate);

    printf ("finished\n");
    return 0;
}

The above duplicates the problem.

I believe it comes down to spawning a new thread while already holding the GIL and then not returning to the same Python thread that was active when the gill was being held. I tried putting rospy.spin() just after the init_node call but it didn't make a difference, which leads me to believe that the problem is not that the call is returning before all the threads have finished.

I'm not at all sure why you can call the same code from the python prompt and it works fine.

@inflector
Copy link
Member

I've posted a question to "answers.ros.org":

http://answers.ros.org/question/227691/crash-when-calling-init_node-while-holding-gil/

Let's see if anyone on the ros team has seen this before.

@linas
Copy link
Member Author

linas commented Feb 27, 2016

If I comment out the PyEval_ReleaseLock(); in the above, then it doesn't crash!

@linas
Copy link
Member Author

linas commented Feb 27, 2016

... the example doesn't crash, but in the 'regular' code, it results in a deadlock...

@linas
Copy link
Member Author

linas commented Feb 27, 2016

Replacing PyEval_ReleaseLock() by PyEval_SaveThread(); allows the demo to finish, but causes the cogserver to crash, during cogserver init, in PySys_SetArgvEx() ...

@linas
Copy link
Member Author

linas commented Feb 27, 2016

surrounding PySys_SetArgvEx() byt GLI state ensure moves the crash down the line, to import_opencog__agent_finder ... feels like forward progress.

@linas
Copy link
Member Author

linas commented Feb 27, 2016

woo-hoo its fixed! Two fixes: one to use PyEval_SaveThread(); instead of PyEval_ReleaseLock() in the atomspace, and a second to surround the opencog module with a GIL

linas added a commit to linas/atomspace that referenced this issue Feb 27, 2016
linas added a commit to linas/opencog that referenced this issue Feb 27, 2016
linas added a commit that referenced this issue Feb 27, 2016
@linas
Copy link
Member Author

linas commented Feb 27, 2016

Using PyEval_SaveThread(); instead of PyEval_ReleaseLock() uncovered multiple locations where GIL locks were missing. I think I got them all, now. All pull requests have merged. The code that originally provoked this now works! Yayyyyy!

@linas linas closed this as completed Feb 27, 2016
@linas
Copy link
Member Author

linas commented Feb 27, 2016

p.s. thanks inflector, this would have remained unsolved without the shortened version!

@inflector
Copy link
Member

I posted a note on ROS Answers. There are two additional issues one needs to worry about.

  1. That the rospy threads terminate correctly. I'm not sure how to do this in the context of your use case. You can't exit the main Python thread, i.e. call Py_Finalize() until rospy is shutdown. The simplest way to do this is to call rospy.spin(), but I'm not sure where you'd do this in your use case. See:

http://wiki.ros.org/rospy/Overview/Initialization%20and%20Shutdown#Testing_for_shutdown

  1. We need to restore the thread state saved by PyEval_SaveThread() using PyThreadState_Swap(saved_thread_state) just before the call to Py_Finalize().

I can take care of 2), but doing 1) will require some sort of exit hook function registration system analogous to rospy's on_shutdown() hook. I imagine a list of python functions that gets iterated over prior to opencog shutdown. The code that initializes rospy would have to register one of these new functions.

@inflector
Copy link
Member

The funny thing is that I had tried PyEval_SaveThread() several times in various permutations but was running into crashes from the other issues and assumed the problem wasn't PyEval_SaveThread().

Your finding it to work for your case gave me the impetus to try it again and work through the other issues with the shortened sample code.

@linas
Copy link
Member Author

linas commented Feb 27, 2016

I did an unclean hack for #2 already, it seems to work fine. For #1 I think its OK to punt. The current behavior node does a while not rospy.is_shutdown() i.e effectively runs an infinite loop in the atomspace. If I exit guile or kill the cogserver, it all comes crashing down, which is maybe not pretty but its acceptable. Clean shutdowns need to be deferred until we gt an actual agent deisgn, instead of the current kludge.

@inflector
Copy link
Member

Calling while not rospy.is_shutdown() is actually one of the recommended ways to ensure that rospy shuts down before the threads terminate. That's pretty much all that rospy.spin() does.

@AraHaan
Copy link

AraHaan commented Jun 8, 2017

I personally have an issue with crashing as well but when I am embedding the interpreter when I try to convert the char *argv[] param in main to wchar_t * using Py_DecodeLocale, and then passing that wchar_t * return data to PySys_SetArgvEx. This is code I basically use (Python 3.5+ only):

int
#ifdef _WIN32
wmain(int argc, wchar_t *argv[])
#else
main(int argc, char *argv[])
#endif
{
  int err;
  wchar_t *program = Py_DecodeLocale(argv[0], NULL);
  if (program == NULL) {
    fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
    exit(1);
  }
  Py_SetProgramName(program);  /* optional but recommended */
  Py_Initialize();
  int initialized = Py_IsInitialized();
  if (initialized != 0) {
#ifdef _WIN32
    /*
      allows use of sys.argv to be possible
      with no tracebacks.

      Note: This is not possible to use in any other
      platform for now until someone helps patch
      this for me (I am not sure what to do to
      make it work like this on them).
    */
    PySys_SetArgvEx(argc, argv, 0);
    // code that reads the script from the PE resource section and runs it.
#else
    //crash here when I try to use it the wchar_t * version of argv on platforms not Windows.
#endif
//snip.

@linas
Copy link
Member Author

linas commented Jun 8, 2017

Hi @AraHaan are you sure you are in the right place? I'm pretty sure that OpenCog does not compile on Windows, and I know that we do not initialize python in main, -- it is a loadable module that gets loaded only if you want to use python. Also, none of our code has been ported to python 3.5 yet -- we need someone knowledgable in python to do this. And finally, we do not use wchar anywhere, we use UTF8 exclusively for all strings.

@AraHaan
Copy link

AraHaan commented Jun 9, 2017

Well for python 3.5+ PySys_SetArgvEx requires that argv is of wchar_t * to use and that it says to use Py_DecodeLocale with it. Problem on my end is that if I do it on non windows (and even on Windows before) the code somehow crashes. Also this was the only place where google given me results about this other than stack overflow (which I cant post questions or answer any anymore on) just to get the fixes I need to the crashes. Because of such I have temporarily put in an #error for non Windows systems until it is fixed.

@linas
Copy link
Member Author

linas commented Jun 9, 2017

We don't call PySys_SetArgvEx anywhere; I just grepped the code. We also do not call Py_DecodeLocale, and we don't support python 3.5, and we don't support Windows, and we don't support Mac OSX or Android. So I believe you are mistakenly posting to the wrong place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants