Python crash related to thread race #671

linas · 2016-02-26T06:23:37Z

Here is a bug is bugging me. It requires three files to reproduce.

File f.scm contains:

(use-modules (opencog) (opencog python))
(python-eval "execfile('b.py')")

file b.py contains:

from rrr import Grrr
Grrr()

file rrr.py contains:

import rospy
class Grrr():
   def __init__(self):
      # The below will hang until roscore is started!
      rospy.init_node("OpenCog_Eva")

The, do this: guile -l f.scm and it crashes with:

Fatal Python error: GC object already tracked

or with

Fatal Python error: ceval: tstate mix-up

or with

Fatal Python error: PyEval_AcquireThread: non-NULL old thread state

but usually it just crashes with no message at all.

The text was updated successfully, but these errors were encountered:

linas · 2016-02-26T06:26:21Z

Stack trace:

(gdb) bt
#0  PyFrame_New (tstate=tstate@entry=0x0, code=code@entry=0x7fffe1f1a930, 
    globals=0x7fffe1f136e0, locals=locals@entry=0x0)
    at ../Objects/frameobject.c:627
#1  0x00007fffe351ddb4 in PyEval_EvalCodeEx (co=0x7fffe1f1a930, 
    globals=<optimized out>, locals=locals@entry=0x0, args=0x7fffe119ace8, 
    argcount=argcount@entry=1, kws=0x7fffe119acf0, kwcount=0, defs=0x0, 
    defcount=0, closure=closure@entry=0x0) at ../Python/ceval.c:3033
#2  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=1, n=1, 
    pp_stack=0x7fffe1182a80, func=0x7fffe157e320) at ../Python/ceval.c:4116
#3  call_function (oparg=<optimized out>, pp_stack=0x7fffe1182a80)
    at ../Python/ceval.c:4041
#4  PyEval_EvalFrameEx (f=f@entry=0x7fffe119ab60, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#5  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, 
    args=args@entry=0x7fffe3050ea8, argcount=1, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=0x0) at ../Python/ceval.c:3252
#6  0x00007fffe35536d0 in function_call (func=0x7fffe157e230, 
    arg=0x7fffe3050e90, kw=0x0) at ../Objects/funcobject.c:526
#7  0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe157e230, 
    arg=arg@entry=0x7fffe3050e90, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#8  0x00007fffe344b7bd in instancemethod_call.8988 (func=0x7fffe157e230, 
---Type <return> to continue, or q <return> to quit--- 
    arg=0x7fffe3050e90, kw=0x0) at ../Objects/classobject.c:2602
#9  0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe1190050, 
    arg=arg@entry=0x7fffe314d050, kw=<optimized out>)
    at ../Objects/abstract.c:2529
#10 0x00007fffe3538577 in PyEval_CallObjectWithKeywords (func=0x7fffe1190050, 
    arg=0x7fffe314d050, kw=<optimized out>) at ../Python/ceval.c:3889
#11 0x00007fffe34a0f92 in t_bootstrap.73787 (boot_raw=0x1091710)
    at ../Modules/threadmodule.c:614
#12 0x00007ffff789b182 in start_thread (arg=0x7fffe1183700)
    at pthread_create.c:312
#13 0x00007ffff75c847d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

linas · 2016-02-26T06:30:06Z

But the stack trace is not always the same; here is another one.

Fatal Python error: ceval: tstate mix-up

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe3518538 in PyEval_EvalFrameEx (f=f@entry=0x7fffe12286f0, 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:874
874 ../Python/ceval.c: No such file or directory.
(gdb) bt
#0  0x00007fffe3518538 in PyEval_EvalFrameEx (f=f@entry=0x7fffe12286f0, 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:874
#1  0x00007fffe351d059 in fast_function (nk=<optimized out>, na=0, n=0, 
    pp_stack=0x7fffffffc1b0, func=0x7fffe120c050) at ../Python/ceval.c:4106
#2  call_function (oparg=<optimized out>, pp_stack=0x7fffffffc1b0)
    at ../Python/ceval.c:4041
#3  PyEval_EvalFrameEx (f=f@entry=0x7fffe118c450, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#4  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, 
    argcount=argcount@entry=2, kws=0x141a710, kwcount=2, defs=0x7fffe1243888, 
    defcount=3, closure=closure@entry=0x0) at ../Python/ceval.c:3252
#5  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=2, n=6, 
    pp_stack=0x7fffffffc3d0, func=0x7fffe11c99b0) at ../Python/ceval.c:4116
#6  call_function (oparg=<optimized out>, pp_stack=0x7fffffffc3d0)
    at ../Python/ceval.c:4041
#7  PyEval_EvalFrameEx (f=f@entry=0x141a510, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#8  0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, 
    argcount=argcount@entry=1, kws=0x7fffe31311e0, kwcount=0, 
    defs=0x7fffe11f13b0, defcount=8, closure=closure@entry=0x0)
    at ../Python/ceval.c:3252
#9  0x00007fffe351cdd8 in fast_function (nk=<optimized out>, na=1, n=1, 
---Type <return> to continue, or q <return> to quit---
    pp_stack=0x7fffffffc5f0, func=0x7fffe11f1cf8) at ../Python/ceval.c:4116
#10 call_function (oparg=<optimized out>, pp_stack=0x7fffffffc5f0)
    at ../Python/ceval.c:4041
#11 PyEval_EvalFrameEx (f=f@entry=0x7fffe3131050, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#12 0x00007fffe351e54d in PyEval_EvalCodeEx (co=<optimized out>, 
    globals=<optimized out>, locals=locals@entry=0x0, 
    args=args@entry=0x7fffe3050de8, argcount=1, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=0x0) at ../Python/ceval.c:3252
#13 0x00007fffe35536d0 in function_call (func=0x7fffe11f3c80, 
    arg=0x7fffe3050dd0, kw=0x0) at ../Objects/funcobject.c:526
#14 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe11f3c80, 
    arg=arg@entry=0x7fffe3050dd0, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#15 0x00007fffe344b7bd in instancemethod_call.8988 (func=0x7fffe11f3c80, 
    arg=0x7fffe3050dd0, kw=0x0) at ../Objects/classobject.c:2602
#16 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe1a3adc0, 
    arg=arg@entry=0x7fffe314d050, kw=<optimized out>)
    at ../Objects/abstract.c:2529
#17 0x00007fffe3538577 in PyEval_CallObjectWithKeywords (
    func=func@entry=0x7fffe1a3adc0, arg=arg@entry=0x7fffe314d050, 
    kw=kw@entry=0x0) at ../Python/ceval.c:3889
#18 0x00007fffe3474988 in PyInstance_New (klass=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    arg=0x7fffe314d050, kw=0x0) at ../Objects/classobject.c:581
#19 0x00007fffe34bfd43 in PyObject_Call (func=func@entry=0x7fffe302dc80, 
    arg=arg@entry=0x7fffe314d050, kw=kw@entry=0x0)
    at ../Objects/abstract.c:2529
#20 0x00007fffe351a816 in do_call (nk=<optimized out>, na=0, 
    pp_stack=0x7fffffffcc00, func=0x7fffe302dc80) at ../Python/ceval.c:4238
#21 call_function (oparg=<optimized out>, pp_stack=0x7fffffffcc00)
    at ../Python/ceval.c:4043
#22 PyEval_EvalFrameEx (f=f@entry=0x7fffe305b200, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#23 0x00007fffe351e54d in PyEval_EvalCodeEx (co=co@entry=0x7fffe30517b0, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70, 
    args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=closure@entry=0x0) at ../Python/ceval.c:3252
#24 0x00007fffe351e682 in PyEval_EvalCode (co=co@entry=0x7fffe30517b0, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70)
    at ../Python/ceval.c:667
#25 0x00007fffe3553882 in run_mod (arena=0x104c0a0, flags=0x0, 
    locals=0x7fffe311bd70, globals=0x7fffe311bd70, filename=<optimized out>, 
    mod=<optimized out>) at ../Python/pythonrun.c:1370
#26 PyRun_FileExFlags (fp=fp@entry=0x1028d80, filename=<optimized out>, 
    start=start@entry=257, globals=0x7fffe311bd70, locals=0x7fffe311bd70, 
    closeit=closeit@entry=1, flags=0x0) at ../Python/pythonrun.c:1356
---Type <return> to continue, or q <return> to quit---
#27 0x00007fffe3553a59 in builtin_execfile (self=<optimized out>, 
    args=<optimized out>) at ../Python/bltinmodule.c:803
#28 0x00007fffe351d0d4 in call_function (oparg=<optimized out>, 
    pp_stack=0x7fffffffcfa0) at ../Python/ceval.c:4020
#29 PyEval_EvalFrameEx (f=f@entry=0x7fffe305b050, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#30 0x00007fffe351e54d in PyEval_EvalCodeEx (co=co@entry=0x7fffe3051730, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70, 
    args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, 
    kwcount=kwcount@entry=0, defs=defs@entry=0x0, defcount=defcount@entry=0, 
    closure=closure@entry=0x0) at ../Python/ceval.c:3252
#31 0x00007fffe351e682 in PyEval_EvalCode (co=co@entry=0x7fffe3051730, 
    globals=globals@entry=0x7fffe311bd70, locals=locals@entry=0x7fffe311bd70)
    at ../Python/ceval.c:667
#32 0x00007fffe35184b9 in run_mod (arena=0x109a720, flags=<optimized out>, 
    locals=0x7fffe311bd70, globals=0x7fffe311bd70, 
    filename=0x7fffe35e9f2b "<string>", mod=<optimized out>)
    at ../Python/pythonrun.c:1370
#33 PyRun_StringFlags (str=<optimized out>, start=257, globals=0x7fffe311bd70, 
    locals=0x7fffe311bd70, flags=<optimized out>) at ../Python/pythonrun.c:1333
#34 0x00007fffe3935f63 in opencog::PythonEval::execute_string (
    this=<optimized out>, command=0x10307e8 "execfile('b.py')\n")
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:618
---Type <return> to continue, or q <return> to quit---
#35 0x00007fffe393370c in opencog::PythonEval::apply_script (
    this=this@entry=0xb4d970, script=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:938
#36 0x00007fffe3933c18 in opencog::PythonEval::eval_expr (this=0xb4d970, 
    partial_expr=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.cc:1279
#37 0x00007fffe3b421e1 in eval (expr=..., this=0xb4d970)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonEval.h:172
#38 opencog::PythonSCM::eval (this=<optimized out>, pystr=...)
    at /home/linas/src/novamente/src/atomspace-git/opencog/cython/PythonSCM.cc:112
#39 0x00007fffe3b3ecbb in opencog::SchemePrimitive<opencog::PythonSCM>::invoke(scm_unused_struct*) [clone .local.0] (this=0xb4d590, args=0xf91070)
    at /home/linas/src/novamente/src/atomspace-git/opencog/guile/SchemePrimitive.h:440
#40 0x00007fffeefc1694 in opencog::PrimitiveEnviron::do_call (sfe=0xf8ab60, 
    arglist=0xf91070)
    at /home/linas/src/novamente/src/atomspace-git/opencog/guile/SchemePrimitive.cc:146
#41 0x00007ffff7b6df53 in vm_debug_engine (thread=0x7fffe12286f0, vp=0x790d80, 
    registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=0)
---Type <return> to continue, or q <return> to quit---
    at ../../libguile/vm-engine.c:800
#42 0x00007ffff7b73969 in scm_call_n (proc=0xa564a0, argv=0x0, nargs=0)
    at ../../libguile/vm.c:1236
#43 0x00007ffff7b6df53 in vm_debug_engine (thread=0x7fffe12286f0, vp=0x790d80, 
    registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=2)
    at ../../libguile/vm-engine.c:800
#44 0x00007ffff7b73969 in scm_call_n (proc=0x7ffff7fe0030, 
    argv=argv@entry=0x7fffffffd738, nargs=nargs@entry=1)
    at ../../libguile/vm.c:1236
#45 0x00007ffff7afddf7 in scm_primitive_eval (exp=exp@entry=0x842910)
    at ../../libguile/eval.c:654
#46 0x00007ffff7afde53 in scm_eval (exp=0x842910, 
    module_or_state=module_or_state@entry=0x8242d0)
    at ../../libguile/eval.c:688
#47 0x00007ffff7b450dd in scm_shell (argc=3, argv=0x7fffffffdd78)
    at ../../libguile/script.c:454
#48 0x00007ffff7b1543d in invoke_main_func (body_data=0x7fffffffdc20)
    at ../../libguile/init.c:338
#49 0x00007ffff7af7e1a in c_body (d=0x7fffffffdb60)
    at ../../libguile/continuations.c:429
#50 0x00007ffff7b70fc2 in vm_regular_engine (thread=0x7fffe12286f0, 
    vp=0x790d80, registers=0x7fffe38ad440 <_Py_CheckRecursionLimit>, resume=10)
    at ../../libguile/vm-engine.c:800
#51 0x00007ffff7b73969 in scm_call_n (proc=proc@entry=0x836ca0, 
---Type <return> to continue, or q <return> to quit---
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1236
#52 0x00007ffff7afcc69 in scm_call_0 (proc=proc@entry=0x836ca0)
    at ../../libguile/eval.c:473
#53 0x00007ffff7b65212 in catch (tag=tag@entry=0x404, thunk=0x836ca0, 
    handler=0x836c80, pre_unwind_handler=0x836c60)
    at ../../libguile/throw.c:135
#54 0x00007ffff7b65645 in scm_catch_with_pre_unwind_handler (
    key=key@entry=0x404, thunk=<optimized out>, handler=<optimized out>, 
    pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:249
#55 0x00007ffff7b656ff in scm_c_catch (tag=tag@entry=0x404, 
    body=body@entry=0x7ffff7af7e10 <c_body>, 
    body_data=body_data@entry=0x7fffffffdb60, 
    handler=handler@entry=0x7ffff7af81a0 <c_handler>, 
    handler_data=handler_data@entry=0x7fffffffdb60, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7ffff7af7f50 <pre_unwind_handler>, pre_unwind_handler_data=pre_unwind_handler_data@entry=0x7f14a0)
    at ../../libguile/throw.c:370
#56 0x00007ffff7af84ef in scm_i_with_continuation_barrier (
    body=body@entry=0x7ffff7af7e10 <c_body>, 
    body_data=body_data@entry=0x7fffffffdb60, 
    handler=handler@entry=0x7ffff7af81a0 <c_handler>, 
    handler_data=handler_data@entry=0x7fffffffdb60, 
    pre_unwind_handler=pre_unwind_handler@entry=0x7ffff7af7f50 <pre_unwind_handler>, pre_unwind_handler_data=0x7f14a0) at ../../libguile/continuations.c:367
---Type <return> to continue, or q <return> to quit---
#57 0x00007ffff7af8585 in scm_c_with_continuation_barrier (
    func=<optimized out>, data=<optimized out>)
    at ../../libguile/continuations.c:463
#58 0x00007ffff7b631ac in with_guile_and_parent (
    base=base@entry=0x7fffffffdbc0, data=data@entry=0x7fffffffdbf0)
    at ../../libguile/threads.c:786
#59 0x00007ffff7185812 in GC_call_with_stack_base (
    fn=fn@entry=0x7ffff7b63160 <with_guile_and_parent>, 
    arg=arg@entry=0x7fffffffdbf0) at misc.c:1925
#60 0x00007ffff7b63668 in scm_i_with_guile_and_parent (parent=<optimized out>, 
    data=data@entry=0x7fffffffdbf0, 
    func=func@entry=0x7ffff7b15420 <invoke_main_func>)
    at ../../libguile/threads.c:829
#61 scm_with_guile (func=func@entry=0x7ffff7b15420 <invoke_main_func>, 
    data=data@entry=0x7fffffffdc20) at ../../libguile/threads.c:835
#62 0x00007ffff7b155f2 in scm_boot_guile (argc=argc@entry=3, 
    argv=argv@entry=0x7fffffffdd78, 
    main_func=main_func@entry=0x400a60 <inner_main>, closure=closure@entry=0x0)
    at ../../libguile/init.c:321
#63 0x0000000000400910 in main (argc=3, argv=0x7fffffffdd78)
    at ../../libguile/guile.c:101
(gdb)

linas · 2016-02-26T06:34:40Z

The second, long stack trace is on thread 1. The first, short stack is from thread 19. I'm thinking that python is (accidentally) being called from multiple threads, without a lock.

linas · 2016-02-26T06:53:00Z

Adding additional locks to the PythonEval ctor and dtor has no effect. I reviewed all locks, and they seem OK to me.

linas · 2016-02-26T07:04:14Z

Googling this error message suggests to not call PyEval_ReleaseLock()
http://stackoverflow.com/questions/8451334/why-is-pygilstate-release-throwing-fatal-python-errors

linas · 2016-02-26T07:28:25Z

@inflector if you can review this, that would be great. If I stub out the PyEval_ReleaseLock the test case above works. A more complex test sort-of works ... a bunch of python does get executed, some rather complex files load, and they do all kids of python ROS stuff, and python prints do print out various messages, just fine, .. .for a while. But after a few seconds it deadlocks in opencog::PythonEval hangs, waiting on some python lock.

I don't understand why it did not dead-lock immediately. Stubbing out PyEval_ReleaseLock() does not deadlock python right away, it only does it after a while ...

inflector · 2016-02-26T14:24:46Z

@linas I haven't played with ROS at all so far. It may take me a bit to get it setup.

Is rospy written entirely in Python? Does it use any C/C++ or Cython that we wrote that's being linked in dynamically? Or it is just a Python wrapper for an ROS library?

(EDIT: No need to answer these now as I've already figured this out.)

inflector · 2016-02-26T14:42:28Z

One thing that does come to mind is that you need to call PyEval_InitThreads(); in every thread that's not a Python thread that calls into Python's C API. Failing to do that will give you the tstate crash. Who is creating the threads in rospy? Are some of them non-Python threads that ROS generates via C++ for example?

inflector · 2016-02-26T14:47:26Z

And the suggestion to not release the GIL lock only applies to Python module definitions, i.e. extending Python with C++, not to the case of initializing threads in C++ for a system that will be using the Python C API to embed Python.

inflector · 2016-02-26T14:58:23Z

It's also possible that in OpenCog we're running an atypical case with rospy, one that may not even work reliably. That is an embedded Python interpreter importing and running rospy rather than a Python program doing so.

linas · 2016-02-26T15:46:39Z

You realize I'm talking about https://github.com/opencog/atomspace/blob/master/opencog/cython/PythonEval.cc right? It does seem to call PyEval_InitThreads() and PyGILState_Ensure() in all the right places.

I doubt that the bug has anything to directly to do with ROS -- there is no mention of ROS in the stack frames. I suspect that the error can be reproduced with other python code...

inflector · 2016-02-26T15:57:52Z

Hmmm, it doesn't appear that this path (i.e. guile from a linux command prompt loading PythonEval via an SCM module) provides a place for our PythonEval system to get initialized. This use of Python from guile outside of a CogServer appears to be new in the last year. I don't recall that being possible before.

In particular, if you call guile from the unix prompt, rather than inside of the CogServer, for instance, then global_python_initialize() will never get called so when you use a PythonEval object it won't have been properly initialized. Our previous usages of Python via a guile prompt were all done inside the CogServer where global_python_initialize() is being called during initialization.

The comment here in PythonSCM.cc hints at the problem.

So the guile/scheme module initialization code that gets called when doing:

(use-modules (opencog) (opencog python))

probably:

void PythonSCM::init(void)

needs to call global_python_initialize().

While you are fixing that, I'd remove the incorrect comments inside global_python_initialize() in PythonEval.cc, i.e change:

    // Throw an exception if this is called more than once.
    if (already_initialized) {
        return;
        throw opencog::RuntimeException(TRACE_INFO,
            "Python initializer global_python_init() called twice.");
    }

to:

    // If we've already initialized then there is nothing to do...
    if (already_initialized)
        return;

or something like that.

inflector · 2016-02-26T16:07:35Z

I'd fix this myself, but I'm not sure which of the scheme module initialization routines is the right one.

linas · 2016-02-26T16:10:08Z

Its also a heisenbug: if, after the rospy.init call, I call os.system("ls") then the above example does not crash. Something about the additional time delay? Or something else?

linas · 2016-02-26T16:36:06Z

I take that back ... it didn't crash, because I had a syntax error (tabs not spaces) Once I fixed the whitespace, it starts crashing again.

linas · 2016-02-26T16:37:18Z

Anyway, this has nothing to do with scheme. I assume it's reproducible just by called PythonEval::insance() from a C++ main() and moving forward from there.

linas · 2016-02-26T16:53:02Z

Here, the below crashes, with no guile in it, at all: its just pure opencog-python:

#include <string>
#include <opencog/cython/PythonEval.h>
using namespace opencog;

int main(int argc, char* argv[])
{
   PythonEval& pyev = PythonEval::instance();
   std::string rv;

   std::string pystr = "execfile('b.py')";
   rv = pyev.eval(pystr);
   printf ("its %s\n", rv.c_str());
   return 0;
}

It crashes with the same collection of error messages, and the same typical stack traces (except, of course, the guile part is not in the stack trace any more.)

inflector · 2016-02-26T17:06:03Z

You realize I'm talking about PythonEval.cc right?

Yes, I was focusing on what was different about this test case path. The first thing that stood out was rospy.

this has nothing to do with scheme

It was the missing global_python_initialize that I thought was the problem, not guile per se.

However, I just noticed that you added a call to global_python_initialize inside the PythonEval constructor last May. The presence of that call changes things since the creation of the singleton should now also call global_python_initialize during the constructor for the singleton.

linas · 2016-02-26T17:23:30Z

I can't reproduce the error by using something other than rospy, although I did not try very hard, because I don't have any complex python code laying around that I could experiment with.

It is somehow related to the execfile. Without the execfile, there is no problem. For example, the following works (if the cogserver is started):

echo "py" | cat - b.py |netcat localhost 17001

linas · 2016-02-26T17:38:51Z

Replacing std::string pystr = "execfile('b.py')"; by std::string pystr = "import b"; results in the same crash.

inflector · 2016-02-26T19:28:54Z

I'm betting that it's the interaction between execfile and rospy spawning some python threads to do the connections. I've got rospy installed and am reproducing the problem on this end now too, and trying to narrow down the test case.

inflector · 2016-02-26T23:20:46Z

I believe the problem is related to the thread spawning in rospy.impl.init start_node which calls:

    node = rosgraph.xmlrpc.XmlRpcNode(port, handler, on_run_error=_node_run_error)
    node.start()

node.start in turn calls:

    def start(self):
        """
        Initiate a thread to run the XML RPC server. Uses thread.start_new_thread.
        """
        _thread.start_new_thread(self.run, ())

this appears to be something you can't do while you have the GIL lock which you must have in order to invoke the interpreter.

I've simplified this down by going straight to the Python C api calls required to call the rospy code. So this reproduces the bug without any OpenCog-specific code whatsoever, and no external files:

#include <Python.h>

int main(int argc, char* argv[])
{
    Py_InitializeEx(0);
    PyEval_InitThreads();

    static const char *argv0 = "test";
    PySys_SetArgv(1, (char **) &argv0);

    PyEval_ReleaseLock();

    PyGILState_STATE gstate;
    gstate = PyGILState_Ensure();

    PyRun_SimpleString("import rospy");
    printf ("import rospy - successful\n");

    PyRun_SimpleString("rospy.init_node('OpenCog_Eva')");
    PyGILState_Release(gstate);

    printf ("finished\n");
    return 0;
}

The above duplicates the problem.

I believe it comes down to spawning a new thread while already holding the GIL and then not returning to the same Python thread that was active when the gill was being held. I tried putting rospy.spin() just after the init_node call but it didn't make a difference, which leads me to believe that the problem is not that the call is returning before all the threads have finished.

I'm not at all sure why you can call the same code from the python prompt and it works fine.

inflector · 2016-02-26T23:57:57Z

I've posted a question to "answers.ros.org":

http://answers.ros.org/question/227691/crash-when-calling-init_node-while-holding-gil/

Let's see if anyone on the ros team has seen this before.

linas · 2016-02-27T02:18:18Z

If I comment out the PyEval_ReleaseLock(); in the above, then it doesn't crash!

linas · 2016-02-27T02:45:19Z

... the example doesn't crash, but in the 'regular' code, it results in a deadlock...

linas · 2016-02-27T03:03:14Z

Replacing PyEval_ReleaseLock() by PyEval_SaveThread(); allows the demo to finish, but causes the cogserver to crash, during cogserver init, in PySys_SetArgvEx() ...

linas · 2016-02-27T03:08:44Z

surrounding PySys_SetArgvEx() byt GLI state ensure moves the crash down the line, to import_opencog__agent_finder ... feels like forward progress.

linas · 2016-02-27T03:12:12Z

woo-hoo its fixed! Two fixes: one to use PyEval_SaveThread(); instead of PyEval_ReleaseLock() in the atomspace, and a second to surround the opencog module with a GIL

This is part 2 of fix to bug opencog/atomspace#671

Bug fix for bug #671

linas · 2016-02-27T04:07:07Z

Using PyEval_SaveThread(); instead of PyEval_ReleaseLock() uncovered multiple locations where GIL locks were missing. I think I got them all, now. All pull requests have merged. The code that originally provoked this now works! Yayyyyy!

linas · 2016-02-27T04:35:21Z

p.s. thanks inflector, this would have remained unsolved without the shortened version!

inflector · 2016-02-27T12:40:39Z

I posted a note on ROS Answers. There are two additional issues one needs to worry about.

That the rospy threads terminate correctly. I'm not sure how to do this in the context of your use case. You can't exit the main Python thread, i.e. call Py_Finalize() until rospy is shutdown. The simplest way to do this is to call rospy.spin(), but I'm not sure where you'd do this in your use case. See:

http://wiki.ros.org/rospy/Overview/Initialization%20and%20Shutdown#Testing_for_shutdown

We need to restore the thread state saved by PyEval_SaveThread() using PyThreadState_Swap(saved_thread_state) just before the call to Py_Finalize().

I can take care of 2), but doing 1) will require some sort of exit hook function registration system analogous to rospy's on_shutdown() hook. I imagine a list of python functions that gets iterated over prior to opencog shutdown. The code that initializes rospy would have to register one of these new functions.

inflector · 2016-02-27T12:46:19Z

The funny thing is that I had tried PyEval_SaveThread() several times in various permutations but was running into crashes from the other issues and assumed the problem wasn't PyEval_SaveThread().

Your finding it to work for your case gave me the impetus to try it again and work through the other issues with the shortened sample code.

linas · 2016-02-27T16:58:15Z

I did an unclean hack for #2 already, it seems to work fine. For #1 I think its OK to punt. The current behavior node does a while not rospy.is_shutdown() i.e effectively runs an infinite loop in the atomspace. If I exit guile or kill the cogserver, it all comes crashing down, which is maybe not pretty but its acceptable. Clean shutdowns need to be deferred until we gt an actual agent deisgn, instead of the current kludge.

inflector · 2016-02-27T18:13:17Z

Calling while not rospy.is_shutdown() is actually one of the recommended ways to ensure that rospy shuts down before the threads terminate. That's pretty much all that rospy.spin() does.

AraHaan · 2017-06-08T08:13:13Z

I personally have an issue with crashing as well but when I am embedding the interpreter when I try to convert the char *argv[] param in main to wchar_t * using Py_DecodeLocale, and then passing that wchar_t * return data to PySys_SetArgvEx. This is code I basically use (Python 3.5+ only):

int
#ifdef _WIN32
wmain(int argc, wchar_t *argv[])
#else
main(int argc, char *argv[])
#endif
{
  int err;
  wchar_t *program = Py_DecodeLocale(argv[0], NULL);
  if (program == NULL) {
    fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
    exit(1);
  }
  Py_SetProgramName(program);  /* optional but recommended */
  Py_Initialize();
  int initialized = Py_IsInitialized();
  if (initialized != 0) {
#ifdef _WIN32
    /*
      allows use of sys.argv to be possible
      with no tracebacks.

      Note: This is not possible to use in any other
      platform for now until someone helps patch
      this for me (I am not sure what to do to
      make it work like this on them).
    */
    PySys_SetArgvEx(argc, argv, 0);
    // code that reads the script from the PE resource section and runs it.
#else
    //crash here when I try to use it the wchar_t * version of argv on platforms not Windows.
#endif
//snip.

linas · 2017-06-08T21:26:26Z

Hi @AraHaan are you sure you are in the right place? I'm pretty sure that OpenCog does not compile on Windows, and I know that we do not initialize python in main, -- it is a loadable module that gets loaded only if you want to use python. Also, none of our code has been ported to python 3.5 yet -- we need someone knowledgable in python to do this. And finally, we do not use wchar anywhere, we use UTF8 exclusively for all strings.

AraHaan · 2017-06-09T01:09:03Z

Well for python 3.5+ PySys_SetArgvEx requires that argv is of wchar_t * to use and that it says to use Py_DecodeLocale with it. Problem on my end is that if I do it on non windows (and even on Windows before) the code somehow crashes. Also this was the only place where google given me results about this other than stack overflow (which I cant post questions or answer any anymore on) just to get the fixes I need to the crashes. Because of such I have temporarily put in an #error for non Windows systems until it is fixed.

linas · 2017-06-09T01:43:14Z

We don't call PySys_SetArgvEx anywhere; I just grepped the code. We also do not call Py_DecodeLocale, and we don't support python 3.5, and we don't support Windows, and we don't support Mac OSX or Android. So I believe you are mistakenly posting to the wrong place.

linas changed the title ~~Python crash heisenbug~~ Python crash Feb 26, 2016

linas changed the title ~~Python crash~~ Python crash related to thread race Feb 26, 2016

linas added the python label Feb 26, 2016

linas added a commit to linas/atomspace that referenced this issue Feb 27, 2016

Bug fix for bug opencog#671

cb5cf6a

linas added a commit to linas/opencog that referenced this issue Feb 27, 2016

Add GIL lock to pprotect python code.

6c7a0a9

This is part 2 of fix to bug opencog/atomspace#671

linas mentioned this issue Feb 27, 2016

Add GIL lock to protect python code. opencog/opencog#2055

Merged

linas added a commit that referenced this issue Feb 27, 2016

Merge pull request #675 from linas/rl

ffb27f8

Bug fix for bug #671

linas mentioned this issue Feb 27, 2016

Add more GIL locks opencog/opencog#2056

Merged

linas closed this as completed Feb 27, 2016

Python crash related to thread race #671

Python crash related to thread race #671

Comments

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

linas commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

inflector commented Feb 26, 2016

linas commented Feb 26, 2016

linas commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

inflector commented Feb 26, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

linas commented Feb 27, 2016

inflector commented Feb 27, 2016

inflector commented Feb 27, 2016

linas commented Feb 27, 2016

inflector commented Feb 27, 2016

AraHaan commented Jun 8, 2017 • edited Loading

linas commented Jun 8, 2017

AraHaan commented Jun 9, 2017 • edited Loading

linas commented Jun 9, 2017

AraHaan commented Jun 8, 2017 •

edited

Loading

AraHaan commented Jun 9, 2017 •

edited

Loading