-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collection issues in parallel #1617
Comments
I've seen this happen with both firedrake and FEniCS. The way I deal with this is to do |
If tediously removing reference cycles will fix the problem then I'm happy to help, as long as it can be done piecemeal. Can you list some of the affected classes? I imagine the reference cycles were there in the first place for a reason -- will removing them break userspace? |
I'd be willing to help as well if it's just tedious work. I am interested in learning more about the pyop2-firedrake toolchain. What would be a good place to start? Which tools do you guys use for detecting memory leaks in python? |
The following construct might be useful for regression testing
|
Some more thoughts on this. The first question is what type of refcycles we need to care about. Some non-exhaustive experimenting and reading suggests the following statement is true. If I have an object This means that what we need is that any object which holds on to (or is referred to) by a PETSc object cannot be part of a reference cycle, but it is fine for it to hold on to objects that are themselves part of reference cycles as long as those objects are safe to collect "out of order". The first place to start is in PyOP2, because all of Firedrake builds on top of that. There are a number of places here where we create reference cycles because we had originally envisaged that PyOP2 would be programmed "by a human". So, for example, a DataSet is "object-cached" on a Set and so forth. PyOP2 itself is a bit of a mess, I think a good first step would be to split the We can then think about refactoring the places in Firedrake that build these PyOP2 objects to maintain the caches "one-way" in Firedrake itself, rather than via ref-cycles. Plausibly we don't need this, but I am not sure. The biggest thing we need to handle is that we use DM objects to pass information back and forth between Firedrake and solvers (inside petsc). We're reasonably good at cleaning those up after the fact, but there are places where I think we probably DTWT. |
Further circular references are generated within pyadjoint, in |
This is now fixed by #2599 |
Introduction
PETSc C objects are managed with a refcounting collector. Destruction is collective over the communicator they were created on.
Some (many) Firedrake objects hold on to PETSc objects. Firedrake objects are managed by the Python garbage collector which has both a refcounting collector (for things with no refcycles) and a generational mark-and-sweep collector (for things with cycles).
Problem
The Python garbage collector has no concept of collective semantics over an MPI communicator. If the "same" object on different processes for some reason gets collected in a different order, we can end up with deadlocks in the PETSc destructors.
If objects are only ever collected by the refcounting part of the collector, things are kind of OK. You could get into a situation where you wrote code that produced a bad problem here (but it's unlikely).
On the other hand, if objects are collected by the generational collector, basically it's "only a matter of time" before things break.
Unfortunately, Firedrake holds cyclic references everywhere. A first step to addressing this problem would be to audit all the code and remove all the refcycles. This is quite hard work, but kind of mostly just tedious. Lots of the cycles actually live in PyOP2.
Possible solutions
Push explicit lifetime management of Firedrake objects
XXX.destroy()
into the user API.This is "simple" but pollutes user code significantly.
Your option here.
The text was updated successfully, but these errors were encountered: