Fix cython's gc_track and gc_untrack #13896

nbruin · 2013-01-01T18:52:39Z

In a long sage-devel thread we eventually found in this message that a GC during a weakref callback on a Cython class can lead to double deallocation of that class. In Python's Objects/typeobject.c, line 1024 and onwards, there are some comments that indicate that earlier version of Python were bitten by this problem too. The solution is to insert the appropriate PyObject_GC_Untrack and PyObject_GC_Track in cython's deallocation code. This is best fixed in cython itself.

Install only the new spkg at http://boxen.math.washington.edu/home/jdemeyer/spkg/cython-0.17.4.spkg

Upstream: Completely fixed; Fix reported upstream

CC: @simon-king-jena @jpflori

Component: memleak

Author: Robert Bradshaw

Reviewer: Jeroen Demeyer

Merged: sage-5.6.beta3

Issue created by migration from https://trac.sagemath.org/ticket/13896

The text was updated successfully, but these errors were encountered:

nbruin · 2013-01-01T18:53:17Z

Patch to more reliably produce crash

nbruin · 2013-01-01T18:55:35Z

comment:1

Attachment: double-free-crash.patch.gz

With attached patch applied to 5.6.beta2 (and probably also other versions close to it),

sage -t devel/sage/sage/modules/module.pyx

will crash relatively reliably on several machines (including sage.math)

jpflori · 2013-01-02T16:48:59Z

comment:3

I'd like to see this ticket as a blocker, anyone against this idea?

nbruin · 2013-01-02T17:34:39Z

comment:4

Replying to @jpflori:

I'd like to see this ticket as a blocker, anyone against this idea?

Since this is the ultimate "can generate segfaults anywhere", it's a prime candidate for blocker status. However, we're fully at the mercy of cython developers as to when this gets fixed. Also, if we release with this bug unfixed, we might as well leave #715 in too, since this one has a much wider possible impact :-).

jpflori · 2013-01-02T19:22:43Z

comment:5

Ok, Ive put it as blocker.

For those who want to play while waiting for upstream, I've posted a p0 Cython spkg which does "something" with PyObject_GC_[Un]Track.
Not sure it makes any sense, but it seems to make our bug disappear.
It's at
http://boxen.math.washington.edu/home/jpflori/cython-0.17.3.p0.spkg

nbruin · 2013-01-02T19:42:08Z

comment:6

Apologies. I saw I linked to the wrong file. Include/object.h also has some interesting information, but it looks like it is a bit out-of-date on some bits. In particular, if you look at the actual use of the TRASHCAN macros:

    PyObject_GC_UnTrack(self);
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
...
  endlabel:
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

with the explanation a little lower:

       Q. Why the bizarre (net-zero) manipulation of
          _PyTrash_delete_nesting around the trashcan macros?

       A. Some base classes (e.g. list) also use the trashcan mechanism.
          The following scenario used to be possible:

          - suppose the trashcan level is one below the trashcan limit

          - subtype_dealloc() is called

          - the trashcan limit is not yet reached, so the trashcan level
        is incremented and the code between trashcan begin and end is
        executed

          - this destroys much of the object's contents, including its
        slots and __dict__

          - basedealloc() is called; this is really list_dealloc(), or
        some other type which also uses the trashcan macros

          - the trashcan limit is now reached, so the object is put on the
        trashcan's to-be-deleted-later list

          - basedealloc() returns

          - subtype_dealloc() decrefs the object's type

          - subtype_dealloc() returns

          - later, the trashcan code starts deleting the objects from its
        to-be-deleted-later list

          - subtype_dealloc() is called *AGAIN* for the same object

          - at the very least (if the destroyed slots and __dict__ don't
        cause problems) the object's type gets decref'ed a second
        time, which is *BAD*!!!

          The remedy is to make sure that if the code between trashcan
          begin and end in subtype_dealloc() is called, the code between
          trashcan begin and end in basedealloc() will also be called.
          This is done by decrementing the level after passing into the
          trashcan block, and incrementing it just before leaving the
          block.

          But now it's possible that a chain of objects consisting solely
          of objects whose deallocator is subtype_dealloc() will defeat
          the trashcan mechanism completely: the decremented level means
          that the effective level never reaches the limit.      Therefore, we
          *increment* the level *before* entering the trashcan block, and
          matchingly decrement it after leaving.  This means the trashcan
          code will trigger a little early, but that's no big deal.

It's probably better to leave out the trashcan for now. It seems like rather tricky code and I'm not sure it's part of the official Python C-API (it might be something internal, just like they use some macros themselves they find unsafe for use in extension modules)

jpflori · 2013-01-02T19:54:25Z

comment:7

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

Anyway I agree it is a better take to leave that out for now, and anyway, upstream will decide what is the best.

So I've updated the spkg to not include the trashcan parts.

jpflori · 2013-01-02T19:54:51Z

Attachment: cython-0.17.3.p0.diff.gz

nbruin · 2013-01-02T21:39:49Z

comment:8

Replying to @jpflori:

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

In fact, I think the precautions taken are not enough for general cython classes. With the little

    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
    ...
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

dance they are making sure there is room for one extra trashcan nesting provided that that call doesn't use the same trick. However, a cython class could have a whole inheritance hierarchy going here (that would all use this trick too!), so I'm pretty sure that the exact scenario they describe could still happen. You'd need to know the depth of the inheritance line (for deallocs, multiple inheritance can't happen, right?) and ensure there's enough room for all those.

robertwb · 2013-01-02T22:24:25Z

comment:9

cython/cython@9a08ff2

Coming up with a nice clean test was...interesting.

jpflori · 2013-01-02T22:31:51Z

comment:10

Just one potentially naive question:
shouldn't the object get retracked iff you're going to call another dealloc method?
or conversely, if the type does not extend a previous type, shouldn't the object stay untracked when you call tp_free?
I'm not sure it would really matter if the object is still tracked in this latter case, but I got this feeling when staring at CPython's code today.

Anyway, it just made me think of what will happen if your extension class is GC tracked, but the base class is not? In this case you're lost because if you track your object before calling the base dealloc, then you will not untrack it there. Is that even possible? And anyway if a class is not gc tracked, or is not a container I guess it cannot be weakrefed...

robertwb · 2013-01-02T22:47:42Z

comment:11

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

jpflori · 2013-01-02T23:03:25Z

comment:12

Thanks for pointing that out.

robertwb · 2013-01-03T05:06:28Z

comment:13

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

nbruin · 2013-01-03T07:32:21Z

comment:14

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

nbruin · 2013-01-03T07:48:38Z

comment:15

Replying to @robertwb:

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

This does look good to me. JP has already confirmed that this fixed the issue (as does your elegant test in the cython suite). Your pre.spkg has some different files in it, but I guess that's why you don't consider it an actual spkg.

jpflori · 2013-01-03T12:39:39Z

comment:16

Replying to @robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list?
In particular won't a call to gc_list_remove(o) be missing?

robertwb · 2013-01-03T17:19:22Z

comment:17

Replying to @jpflori:

Replying to @robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list?
In particular won't a call to gc_list_remove(o) be missing?

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

robertwb · 2013-01-03T17:22:27Z

comment:18

Replying to @nbruin:

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

Yeah, this is a separate (and more complicated to resolve) issue.

nbruin · 2013-01-03T17:52:16Z

comment:19

Replying to @robertwb:

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

... so suppose we have a superclass that doesn't do the untrack/track dance (so this must be a non-container superclass of a container class. We're entering rather hypothetical territory here). We'll be entering its dealloc with tracking SET. I guess the actual memory free happens by our class, so I guess the list of GC-tracked objects will be properly amended eventually. Can we prove that no GC or trashcan-shelving of this intermediate object will happen in between? I guess it's unlikely because non-container types should be easy to deallocate ... unless some callous person writes an extension class that does hold references to other objects but is convinced that those will never lead to cycles and hence makes it non-GC-tracked. Some weakref callbacks and a GC could then find a partially torn down object tracked by the GC. Multithreaded stuff could make this even worse, but I guess we're protected by the GIL here.

It should probably be mandated that any container type has to participate in GC. For a non-container type it's hard to see how a dealloc could ever be interrupted or interleaved by a GC. So this note is probably more a request for clarification (addition to documentation somewhere?) why this is not a problem than a diagnosis of a bug.

robertwb · 2013-01-03T19:05:48Z

comment:20

I think it helps to look at the generated code. Suppose one has

cdef class A: ...
cdef class B(A): ...
cdef class C(B): ...
...

In this case one has, roughly,

tp_dealloc_A(self) {
   [optional untrack]
   bodyA
   [optional track]
   PY_TYPE(self)->tp_free(self)
}

tp_dealloc_B(self) {
   [optional untrack]
   bodyB
   [optional track]
   tp_dealloc_A(self)
}

tp_dealloc_C(self) {
   [optional untrack]
   bodyC
   [optional track]
   tp_dealloc_B(self)
}

...

bodyX consists of decrefing Python members, traversing weakrefs, and (if present)

PyRef(self)++;
X.__dealloc__(self);
PyRef(self)--;

The track/untrack markers are added exactly when Python/weakref members are present, which is where a garbage collection might happen. (When executing __dealloc__ the refcount is incremented, also preventing garbage collection.)

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

jpflori · 2013-01-03T19:13:12Z

comment:21

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

That is exactly what I was thinking about, and IIRC what is looked for in the CPython subtype_dealloc when looking for the base type.

If you say it cannot happy in Cython, I'm very happy with that!

jpflori · 2013-01-03T20:35:22Z

comment:22

Are you sure this is the case, e.g., for category_object and sage_object?
I see a TPFLAGS_HAVE_GC on the former but not on the latter.

nbruin · 2013-01-03T21:09:37Z

Robert's cython test case (I spent quite some time twice to find it, so I'm storing it here for future reference)

vbraun · 2013-01-03T21:21:15Z

comment:23

Attachment: double_dealloc_T796.pyx.gz

And Robert just released Cython 0.17.4, see https://groups.google.com/d/topic/cython-users/s3ycj83Yctw/discussion

robertwb · 2013-01-03T21:47:45Z

comment:24

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4.spkg

jdemeyer · 2013-01-04T09:38:34Z

comment:25

Typo in the version number:

=== cython-0.17.3 (Robert Bradshaw, 3 January 2013) ===

should be

=== cython-0.17.4 (Robert Bradshaw, 3 January 2013) ===

jdemeyer · 2013-01-04T09:38:34Z

Author: Robert Bradshaw

jdemeyer · 2013-01-04T12:57:52Z

comment:26

Fixed SPKG.txt.

jdemeyer · 2013-01-04T12:57:52Z

Reviewer: Jeroen Demeyer

jdemeyer · 2013-01-04T13:00:23Z

Changed upstream from Reported upstream. Developers acknowledge bug. to Completely fixed; Fix reported upstream

robertwb · 2013-01-04T18:26:11Z

comment:28

D'oh. Thanks.

jdemeyer · 2013-01-07T20:58:26Z

Merged: sage-5.6.beta3

jdemeyer · 2013-01-10T09:42:07Z

comment:30

I have not seen anymore segmentation faults regarding #715, so this might have fixed it.

vbraun · 2013-01-10T09:55:38Z

comment:31

Yay! Congratulations to everybody and a special thanks to Simon for pushing the weak caches!

nbruin added this to the sage-5.6 milestone Jan 1, 2013

nbruin added t: bug labels Jan 1, 2013

nbruin assigned rlmill Jan 1, 2013

jpflori added p: blocker / 1 and removed p: major / 3 labels Jan 2, 2013

This comment has been minimized.

Sign in to view

robertwb added the s: needs review label Jan 3, 2013

jdemeyer added s: needs work and removed s: needs review labels Jan 4, 2013

This comment has been minimized.

Sign in to view

jdemeyer added s: positive review and removed s: needs work labels Jan 4, 2013

jdemeyer removed the s: positive review label Jan 7, 2013

jdemeyer closed this as completed Jan 7, 2013

williamstein mentioned this issue Feb 7, 2013

rewrite conway polynomial spkg and code in Sage library to not use ZODB #12205

Closed

jpflori mentioned this issue Feb 9, 2013

Configure Python with pydebug when SAGE_DEBUG is set #13864

Closed

nbruin mentioned this issue Dec 29, 2022

Fix cython's deep C-stacks upon deallocation #13901

Open

This was referenced Feb 28, 2013

Better deletion of items of TripleDict #13904

Closed

Fix inspection of interactive Cython code #13916

Closed

Fix cython's gc_track and gc_untrack #13896

Fix cython's gc_track and gc_untrack #13896

Comments

nbruin commented Jan 1, 2013

nbruin commented Jan 1, 2013

nbruin commented Jan 1, 2013

jpflori commented Jan 2, 2013

nbruin commented Jan 2, 2013

jpflori commented Jan 2, 2013

This comment has been minimized.

nbruin commented Jan 2, 2013

jpflori commented Jan 2, 2013

jpflori commented Jan 2, 2013

nbruin commented Jan 2, 2013

robertwb commented Jan 2, 2013

jpflori commented Jan 2, 2013

robertwb commented Jan 2, 2013

jpflori commented Jan 2, 2013

robertwb commented Jan 3, 2013

nbruin commented Jan 3, 2013

nbruin commented Jan 3, 2013

jpflori commented Jan 3, 2013

robertwb commented Jan 3, 2013

robertwb commented Jan 3, 2013

nbruin commented Jan 3, 2013

robertwb commented Jan 3, 2013

jpflori commented Jan 3, 2013

jpflori commented Jan 3, 2013

nbruin commented Jan 3, 2013

vbraun commented Jan 3, 2013

This comment has been minimized.

robertwb commented Jan 3, 2013

jdemeyer commented Jan 4, 2013

jdemeyer commented Jan 4, 2013

This comment has been minimized.

jdemeyer commented Jan 4, 2013

jdemeyer commented Jan 4, 2013

jdemeyer commented Jan 4, 2013

robertwb commented Jan 4, 2013

jdemeyer commented Jan 7, 2013

jdemeyer commented Jan 10, 2013

vbraun commented Jan 10, 2013