Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite segfault with python 3.11 RC1 #85

Closed
jamesjer opened this issue Sep 8, 2022 · 5 comments
Closed

Test suite segfault with python 3.11 RC1 #85

jamesjer opened this issue Sep 8, 2022 · 5 comments

Comments

@jamesjer
Copy link
Contributor

jamesjer commented Sep 8, 2022

I attempted to build pyml 20220905 for Fedora Rawhide, which currently has python 3.11 RC1. The test suite failed with a segfault:

File "dune", line 41, characters 8-18:
41 |   (name pyml_tests)
             ^^^^^^^^^^
Command [11] got signal SEGV:
$ (cd _build/default && ./pyml_tests.exe)
...
Test 'string conversion error' ... Caught failure: Type mismatch: String or Unicode expected. Got: Long (0)
passed
Test 'float conversion error' ... 

GDB says:

Program received signal SIGSEGV, Segmentation fault.
0x00007fc0ba1e5a97 in unicode_fromformat_write_cstr (writer=0x7ffd6bf72b70, str=0x0, width=-1, precision=100) at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/unicodeobject.c:2773
2773	        while (length < precision && str[length]) {
(gdb) bt
#0  0x00007fc0ba1e5a97 in unicode_fromformat_write_cstr (writer=0x7ffd6bf72b70, str=0x0, width=-1, precision=100)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/unicodeobject.c:2773
#1  0x00007fc0ba1bd50c in unicode_fromformat_arg (vargs=0x7ffd6bf72bd0, f=<optimized out>, writer=0x7ffd6bf72b70)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/unicodeobject.c:2983
#2  PyUnicode_FromFormatV (format=<optimized out>, vargs=<optimized out>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/unicodeobject.c:3100
#3  0x00007fc0ba2523ea in _PyErr_FormatV (tstate=0x7fc0ba51ff90 <_PyRuntime+166320>, 
    exception=0x7fc0ba40ab00 <_PyExc_TypeError>, 
    format=0x7fc0ba2e9310 "'%s' not supported between instances of '%.100s' and '%.100s'", 
    vargs=vargs@entry=0x7ffd6bf72c60) at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Python/errors.c:1078
#4  0x00007fc0ba2a6ae3 in _PyErr_Format (tstate=<optimized out>, exception=<optimized out>, format=<optimized out>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Python/errors.c:1104
#5  0x00007fc0ba1e5034 in do_richcompare (op=0, w=0x7fc0ba40ab00 <_PyExc_TypeError>, v=0x55755d4647f0, 
    tstate=0x7fc0ba51ff90 <_PyRuntime+166320>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Include/object.h:133
#6  PyObject_RichCompare (v=0x55755d4647f0, w=0x7fc0ba40ab00 <_PyExc_TypeError>, op=<optimized out>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/object.c:729
#7  0x00007fc0ba1e4e84 in PyObject_RichCompareBool (v=<optimized out>, w=<optimized out>, op=<optimized out>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/object.c:751
#8  0x000055755d050346 in rich_compare_bool_nofail (opid=0, o2=0x7fc0ba40ab00 <_PyExc_TypeError>, o1=0x55755d4647f0)
    at /builddir/build/BUILD/pyml-20220905/_build/default/pyml_stubs.c:222
#9  pycompare (v1=<optimized out>, v2=<optimized out>)
    at /builddir/build/BUILD/pyml-20220905/_build/default/pyml_stubs.c:247
#10 0x000055755d0dd1aa in do_compare_val ()
#11 0x000055755d0dd781 in caml_equal ()
#12 0x000055755d071fff in camlPy__python_exception_2003 () at py.ml:799
#13 0x000055755d073979 in camlPy__to_float_2596 () at py.ml:1210
#14 0x000055755d04d5b6 in camlDune__exe__Pyml_tests__fun_1983 () at pyml_tests.ml:331
#15 0x000055755d067fc3 in camlPyml_tests_common__launch_test_320 () at pyml_tests_common.ml:18
#16 0x000055755d0681d3 in camlPyml_tests_common__launch_tests_530 () at pyml_tests_common.ml:41
#17 0x000055755d0686ca in camlPyml_tests_common__main_613 () at pyml_tests_common.ml:92
#18 0x000055755d050224 in camlDune__exe__Pyml_tests__entry () at pyml_tests.ml:714
#19 0x000055755d049289 in caml_program ()
#20 0x000055755d0f7329 in caml_start_program ()
#21 0x000055755d0f76bc in caml_startup_common ()
#22 0x000055755d0f773f in caml_main ()
#23 0x000055755d048e52 in main ()

A null pointer is being passed to unicode_fromformat_write_cstr. Working up the stack to see where that comes from leads to frame 6, where we see that object v has a bogus type:

(gdb) frame 6
#6  PyObject_RichCompare (v=0x55755d4647f0, w=0x7fc0ba40ab00 <_PyExc_TypeError>, op=<optimized out>)
    at /usr/src/debug/python3.11-3.11.0~rc1-2.fc38.x86_64/Objects/object.c:729
729	    PyObject *res = do_richcompare(tstate, v, w, op);
(gdb) print *v
$1 = {ob_refcnt = 5, ob_type = 0x7fc0baeed960}
(gdb) print *$1.ob_type
$2 = {ob_base = {ob_base = {ob_refcnt = 0, ob_type = 0x0}, ob_size = 0}, tp_name = 0x0, tp_basicsize = 0, 
  tp_itemsize = 0, tp_dealloc = 0x0, tp_vectorcall_offset = 0, tp_getattr = 0x0, tp_setattr = 0x0, tp_as_async = 0x0, 
  tp_repr = 0x0, tp_as_number = 0x0, tp_as_sequence = 0x0, tp_as_mapping = 0x0, tp_hash = 0x0, tp_call = 0x0, 
  tp_str = 0x0, tp_getattro = 0x0, tp_setattro = 0x0, tp_as_buffer = 0x0, tp_flags = 0, tp_doc = 0x0, 
  tp_traverse = 0x0, tp_clear = 0x0, tp_richcompare = 0x0, tp_weaklistoffset = 0, tp_iter = 0x0, tp_iternext = 0x0, 
  tp_methods = 0x0, tp_members = 0x0, tp_getset = 0x0, tp_base = 0x0, tp_dict = 0x0, tp_descr_get = 0x0, 
  tp_descr_set = 0x0, tp_dictoffset = 0, tp_init = 0x0, tp_alloc = 0x0, tp_new = 0x0, tp_free = 0x0, tp_is_gc = 0x0, 
  tp_bases = 0x0, tp_mro = 0x0, tp_cache = 0x0, tp_subclasses = 0x0, tp_weaklist = 0x0, tp_del = 0x0, 
  tp_version_tag = 0, tp_finalize = 0x0, tp_vectorcall = 0x0}

The null tp_name field is what ultimately causes the segfault. (The other object, w, is an instance of TypeError.) The comparison with the bogus object happens on line 799 of py.ml (frame 12):

Lazy.force ocaml_exception_class = ptype

This seems to mean that ptype is the instance of TypeError, and ocaml_exception_class contains a Python object with a bad ob_type field.

Address randomization seems to be involved. Fedora builds PIE objects by default.

  • GDB sees the segfault only if set disable-randomization off is executed prior to starting execution. Otherwise, GDB sees the tests complete successfully.
  • Valgrind runs always complete successfully, with no reports of use-after-free or out-of-bound access errors. I don't know how to make valgrind enable address randomization.
  • Running setarch -R ./pyml_tests.exe consistently succeeds, and omitting the setarch call consistently segfaults.

I'm willing to experiment if anybody has an idea.

@thierry-martinez
Copy link
Owner

thierry-martinez commented Sep 9, 2022

Thank you very much for your report. Could you help me reproducing it? I tried in the branch https://github.com/thierry-martinez/pyml/tree/github-actions-python311 without success: there are two github actions, one using Fedora Rawhide, the other using ubuntu by compiling python 3.11 rc1 from scratch, and both succeed running test suites (as this can be seen here: https://github.com/thierry-martinez/pyml/actions/runs/3025355753).

The Dockerfiles used by these actions are in https://github.com/thierry-martinez/pyml/tree/github-actions-python311/dockerfiles/fedora-rawhide and https://github.com/thierry-martinez/pyml/tree/github-actions-python311/dockerfiles/python-3.11.0rc1 .

If you may provide a Dockerfile reproducing the segmentation fault, it would be very helpful!
Otherwise, if you may copy the full log of the tests, with the preamble at the beginning giving the path and the name of the library, it can help as well. Thank you very much!

@jamesjer
Copy link
Contributor Author

jamesjer commented Oct 3, 2022

Sorry for the long delay. I got caught up in a mad rush to get other things finalized before Fedora 37 final freeze. I don't have a Dockerfile, sorry. I'm using mock, which is what Fedora uses to build packages as well.

I did a little debugging to see if I could gather more information, and I think I see the problem. Here's what happens.

  1. The Python library is loaded.
  2. We run the "ocaml exception" test at pyml_tests.ml line 146.
  3. The ocaml_exception_class object is created. Its ob_type pointer points to PyType_Type.
  4. We run the "reinitialize" test at pyml_tests.ml line 292. This reloads the library. If I use either address randomization or the python debug library (in the python3-debug package on Fedora), then the library is reloaded at a different address. The ocaml_exception_class object now has an ob_type pointer that points to something other than PyType_Type.
  5. We run the "float conversion error" test. When we try to compare the thrown exception with ocaml_exception_class, we get the segfault reported above.

The bottom line is that I think that the object in ocaml_exception_class (if there is one) needs to be discarded prior to reloading the python library.

@thierry-martinez
Copy link
Owner

Thank you for your debugging! I am still not able to reproduce the bug, but it is indeed bad to keep references to some python objects while the library has been unloaded. It should be fixed in #86: could you check if it is indeed the case?

@jamesjer
Copy link
Contributor Author

Yes, that fixes the issue. Thank you for the quick response!

thierry-martinez added a commit that referenced this issue Oct 24, 2022
…on_finalize

Fix #85: segmentation fault by forgetting objects on library unloading
@thierry-martinez
Copy link
Owner

Thank you, @jamesjer ! Sorry for the late merging. I will try to make a release soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants