Garbage collection issues in parallel #1617

wence- · 2020-02-27T13:48:02Z

Introduction

PETSc C objects are managed with a refcounting collector. Destruction is collective over the communicator they were created on.

Some (many) Firedrake objects hold on to PETSc objects. Firedrake objects are managed by the Python garbage collector which has both a refcounting collector (for things with no refcycles) and a generational mark-and-sweep collector (for things with cycles).

Problem

The Python garbage collector has no concept of collective semantics over an MPI communicator. If the "same" object on different processes for some reason gets collected in a different order, we can end up with deadlocks in the PETSc destructors.

If objects are only ever collected by the refcounting part of the collector, things are kind of OK. You could get into a situation where you wrote code that produced a bad problem here (but it's unlikely).

On the other hand, if objects are collected by the generational collector, basically it's "only a matter of time" before things break.

Unfortunately, Firedrake holds cyclic references everywhere. A first step to addressing this problem would be to audit all the code and remove all the refcycles. This is quite hard work, but kind of mostly just tedious. Lots of the cycles actually live in PyOP2.

Possible solutions

Push explicit lifetime management of Firedrake objects `XXX.destroy()` into the user API.

This is "simple" but pollutes user code significantly.

Your option here.

The text was updated successfully, but these errors were encountered:

pefarrell · 2020-02-27T13:57:39Z

I've seen this happen with both firedrake and FEniCS.

The way I deal with this is to do import gc; gc.disable() at the start of my scripts, and manually call gc.collect() at various points (typically after solves). I don't understand the Python garbage collection semantics well enough to guarantee that this fixes the problem, but it goes some way towards addressing it.

danshapero · 2020-03-03T00:24:48Z

If tediously removing reference cycles will fix the problem then I'm happy to help, as long as it can be done piecemeal. Can you list some of the affected classes? I imagine the reference cycles were there in the first place for a reason -- will removing them break userspace?

salazardetroya · 2020-03-03T19:17:04Z

I'd be willing to help as well if it's just tedious work. I am interested in learning more about the pyop2-firedrake toolchain. What would be a good place to start? Which tools do you guys use for detecting memory leaks in python?

jrmaddison · 2020-03-04T10:08:02Z

The following construct might be useful for regression testing

import gc
import weakref


class Test:
    pass


refcount_only = (Test,)
refcount_only_refs = []
refcount_error = [False]


def gc_callback(phase, info):
    if phase == "start":
        for obj in gc.get_objects():
            if isinstance(obj, refcount_only):
                refcount_only_refs.append(weakref.ref(obj))
    else:
        # assert phase == "stop"
        for ref in refcount_only_refs:
            # If ref() is None then the object was collected by the garbage
            # collector, which is forbidden
            if ref() is None:
                refcount_error[0] = True
        refcount_only_refs.clear()


gc.callbacks.append(gc_callback)

wence- · 2020-04-07T14:24:47Z

Some more thoughts on this. The first question is what type of refcycles we need to care about.

Some non-exhaustive experimenting and reading suggests the following statement is true.

If I have an object x which contains a reference cycle, and refer to it via an object y, then x will be collected by the generational collector, but y will be collected by the refcounting collector (which is what we want).

This means that what we need is that any object which holds on to (or is referred to) by a PETSc object cannot be part of a reference cycle, but it is fine for it to hold on to objects that are themselves part of reference cycles as long as those objects are safe to collect "out of order".

The first place to start is in PyOP2, because all of Firedrake builds on top of that. There are a number of places here where we create reference cycles because we had originally envisaged that PyOP2 would be programmed "by a human". So, for example, a DataSet is "object-cached" on a Set and so forth.

PyOP2 itself is a bit of a mess, I think a good first step would be to split the base.py and petsc_base.py files into a data_structures/ subdirectory, with one file for each object exposed in the API. We can then start fixing things slowly there with better overview of the code.

We can then think about refactoring the places in Firedrake that build these PyOP2 objects to maintain the caches "one-way" in Firedrake itself, rather than via ref-cycles. Plausibly we don't need this, but I am not sure.

The biggest thing we need to handle is that we use DM objects to pass information back and forth between Firedrake and solvers (inside petsc). We're reasonably good at cleaning those up after the fact, but there are places where I think we probably DTWT.

jrmaddison · 2021-04-01T10:50:05Z

Further circular references are generated within pyadjoint, in pyadjoint.overloaded_type.OverloadedType.create_block_variable. (EDIT: Identify correct method)

JDBetteridge · 2022-12-02T12:55:07Z

This is now fixed by #2599

krober10nd mentioned this issue Feb 28, 2021

memory leak on repeat call to leapfrog NDF-Poli-USP/spyro#2

Open

connorjward mentioned this issue May 7, 2021

Modernising PyOP2 OP2/PyOP2#615

Closed

JDBetteridge closed this as completed Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection issues in parallel #1617

Garbage collection issues in parallel #1617

wence- commented Feb 27, 2020

pefarrell commented Feb 27, 2020

danshapero commented Mar 3, 2020

salazardetroya commented Mar 3, 2020

jrmaddison commented Mar 4, 2020

wence- commented Apr 7, 2020

jrmaddison commented Apr 1, 2021 •

edited

Loading

JDBetteridge commented Dec 2, 2022

Garbage collection issues in parallel #1617

Garbage collection issues in parallel #1617

Comments

wence- commented Feb 27, 2020

Introduction

Problem

Possible solutions

Push explicit lifetime management of Firedrake objects XXX.destroy() into the user API.

Your option here.

pefarrell commented Feb 27, 2020

danshapero commented Mar 3, 2020

salazardetroya commented Mar 3, 2020

jrmaddison commented Mar 4, 2020

wence- commented Apr 7, 2020

jrmaddison commented Apr 1, 2021 • edited Loading

JDBetteridge commented Dec 2, 2022

Push explicit lifetime management of Firedrake objects `XXX.destroy()` into the user API.

jrmaddison commented Apr 1, 2021 •

edited

Loading