Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling AtomGroups #293

Closed
mnmelo opened this issue Jun 1, 2015 · 14 comments · Fixed by #321
Closed

Pickling AtomGroups #293

mnmelo opened this issue Jun 1, 2015 · 14 comments · Fixed by #321

Comments

@mnmelo
Copy link
Member

mnmelo commented Jun 1, 2015

Universe pickling might be somewhat tricky (#173), and unpickling might be a relatively heavy operation if it involves reloading trajectories.
I've run into the need to pickle Universes mostly as a side effect of actually wanting to pickle AtomGroups, in my case as a return from a multiprocessing.pool.map.

My idea was to provide lightweight pickling and unpickling functions that'd do the following:

  • Pickling: only atom indices are stored, together with some sort of heuristic for Universe identification (total number of atoms and trajectory filename, for instance).
  • Unpickling: run through globals() looking for a Universe that matches the heuristic. Recreate AtomGroup from the indices. (Raise an error if no suitable Universe is found).

Search could be sped up by saving with the pickle the Universe names that have the same id as the AtomGroup's, and then looking at those names first when unpickling:

universe_names = []
key, val = None, None #Extremely unpythonic, but these can't be created on the for loop otherwise the globals() dict gets changed.  
for key,val in globals().iteritems():
    if val is atom_group.universe:
        universe_names.append(key)

The whole thing feels quite unpythonic, and granted can be easily worked around by just passing back and forth atom index lists instead of AtomGroups. But it'd make MDAnalysis work better and more natively with other code that requires pickling.

Opinions?

@mnmelo
Copy link
Member Author

mnmelo commented Jun 1, 2015

Correction: I guess the Universe should be looked up in locals() instead.

@richardjgowers
Copy link
Member

Can't an AtomGroup access self.universe.filename and self.universe.trajectory.filename to rebuild the Universe it started from? Maybe it also needs self.universe.trajectory.frame too, to make sure that it's at the same frame?

@richardjgowers
Copy link
Member

And I think the heuristic for Universe could even be some sort of __hash__ for the Universe, something like hash(u.filename) + hash(u.trajectory.filename)

@mnmelo
Copy link
Member Author

mnmelo commented Jun 1, 2015

Yes, that's what I was suggesting for the heuristic, but my point is that rebuilding the Universe is probably slow and even unnecessary if one already is loaded in the unpickling scope.

@richardjgowers
Copy link
Member

Hmm yeah, searching locals for the Universe makes most sense... but then if many AtomGroups refer to the same Universe, and they want different frames.. won't this break things? Ie.. don't AtomGroups want a "lock" on the Universe frame?

@mnmelo
Copy link
Member Author

mnmelo commented Jun 1, 2015

Hmm... Should they carry such locks? My view of AtomGroups is that they're independent of the frame. Would users expect the AtomGroup.positions to point to those of the originating frame?
I think the current interface makes it quite clear that it's up to the user to know at which frame the Universe.trajectory is at, and that AtomGroups reflect the positions of the current frame.

@richardjgowers
Copy link
Member

Sorry yeah, I'm imagining you want to parallelise by doing AtomGroups individually.. But if you then wanted to iterate the trajectory for each AG, won't this go wrong?

for ag in ags_to_do:  # if you pool here..
    for ts in u.trajectory:  # then this will be shared amongst all AGs
        # process ag

So I guess you just need to make sure that nobody changes the frame, so you need to order your loops like this:

for ts in u.trajectory:  # iterate the trajectory outside the parallelism
    for ag in ags_to_do:  # split work here
        # process ag

I think you're right so long as the loops are ordered correctly.

@mnmelo
Copy link
Member Author

mnmelo commented Jun 3, 2015

Just tried out some code and it turns out that what I'm proposing is not going to fly because even globals() will only allow access to module-level globals.

This means that the following code won't work:

syst = MDAnalysis.Universe("conf.gro")
ag = syst.atoms[:10]
pkl = pickle.dumps(ag) #pickling can generate a hash from filenames and number of atoms
                       #but can't look through the main module's variables
                       #to get the names of the AtomGroup's Universe

ag2 = pickle.loads(pkl) #unpickling takes place in the AtomGroup class
                        #and only variables at that module level are visible through globals();
                        #'syst' will never be found.

Introspection, via the inspect package, might work, but is extremely hackish, not to mention probably unportable.

But I just came up with an idea that might work:
We could keep a list in the MDAnalysis module of weakrefsto every Universe that is created. Unpickling just has to go through that list. No name guessing would be involved.

@richardjgowers
Copy link
Member

So as Universes are created, they're stored in the MDAnalysis module that has been imported? So as AtomGroups are unpickled they search through the local MDAnalysis scope looking for Universes? Sounds sane enough, but I'm not sure how this is all "meant" to be done.

@mnmelo
Copy link
Member Author

mnmelo commented Jun 3, 2015

Exactly. I just came across the weakref module and it seems to do all we need.

@mnmelo
Copy link
Member Author

mnmelo commented Jun 4, 2015

It does work (yay!), with an issue:
While the weakrefs can deal with Universes being deleted or going out of scope, they will still point to a valid object until garbage is collected.
The problem with this is that it may happen that an unpickled AtomGroup gets bound to a stale Universe that matches the control heuristic. This might happen if a user rebuilds the same Universe repeatedly (is this a realistic use case?).

A solution is to call gc.collect() manually when unpickling. I'm not sure if this is desirable. I've pushed a branch feature-pickleAtomGroup (no unit testing yet) that works fine with this approach. Please comment.

An alternative would be to register a __del__ function with the Universe class that removes the weak reference from the built Universe list when the Universe gets deleted. This turns out to be a major pain, as @orbeckst dutifully warns in AtomGroup.py:

    # NOTE: DO NOT ADD A __del__() method: it somehow keeps the Universe
    #       alive during unit tests and the unit tests run out of memory!
    #### def __del__(self): <------ do not add this! [orbeckst]

For reference, the problem here seems to be that when doing this we'll have Universe and Reader instances mutually referencing each other while also having __del__ destructors. Python is then unable to decide on a cleanup order since it does not know which attributes are needed by each __del__ function.
This is probably matter for another issue. Maybe it's simple to prevent this reference cycle, but I couldn't find where the Reader instance references the Universe.

@richardjgowers
Copy link
Member

Re: Universe.__del__, I've always meant to try and reimplement that, it should be possible!

Your feature branch looks good, but you need to merge develop into it so that it has all the travis (automatic testing) settings. Then if you open a pull request, it'll do lots of cool stuff automagically, eg:

#294

@mnmelo
Copy link
Member Author

mnmelo commented Jun 16, 2015

I think I can imagine an application that breaks the auto assignment of a universe when unpickling AtomGroups:

Let's say you want to do an all-to-all trajectory calculation. You load two universes with the same topology and filename. If you then pickle/unpickle an AtomGroup you're in trouble, irrespective of the __del__ issues mentioned earlier, because there's no way to automatically know which Universe you want it on. You also can't specify it when unpickling, so you're stuck with checking AtomGroup.universe is desired_universe afterwards.

My solution to this would be the following:
Add an optional flag is_atomgroup_anchor=True to the initialization of the Universe. Only Universes that are anchors will be added to that module-level weakref list.
It is then up to the user to set is_atomgroup_anchor=False on Universes they know might conflict with the desired one. Still not failsafe (what if you want some AtomGroups in universe1 and some in universe2?) but I think we can live with those fringe cases.

In addition, Universe methods set_atomgroup_anchor and unset_atomgroup_anchor can be added for when a user wants to add/remove the Universe to/from the anchor list, for finer control after Universe creation. This also puts the burden on the user to unset the anchor on Universes that are going out of scope but might become problematic later if an unpickling occurs before the garbage collection.
None of this should break anything, and even the simpler (and probably more common) cases of pickling/unpickling AtomGroups would happen without the user even having to explicitly use any of the anchoring features.
So, shall I go ahead and implement this?

@orbeckst
Copy link
Member

Maybe I am missing key points after skimming the thread but it seems that what you want is a unique ID for each Universe so that if you re-instantiate a universe with the same ID you can attach the same AtomGroups again.

I might not get the use case that you have in mind but it somewhat reminds me of issues that @dotsdl is solving with his MDSynthesis persistence framework in a different manner.

An implementation might be worthwhile, just to show how it works. As @richardjgowers said, make sure you rebase against the latest develop and then generate a pull request. It will get unit tested by travis and provides a convenient frame work for code discussion.

mnmelo added a commit that referenced this issue Jun 21, 2015
…verses

AtomGroups can now be pickled/unpickled (closes #293)
A weakref system was implemented so that Universes can have a __del__
method without memleaking (closes #297)
ChainReaders no longer leak (and also no longer have a __del__  method)
(closes #312)
Tests now specifically look for uncollectable objects at the end. Breaks
parallelization, though.
orbeckst pushed a commit that referenced this issue Aug 21, 2020
… (PR #2893)

* AtomGroup now are pickled/unpickled without looking for its anchored Universe
* removed core.universe._ANCHOR_UNIVERSES (originally introduced with #293) and related machinery to keep global state. No global state is kept any more.
* update docs for atomgroup pickle
* update CHANGELOG
* update tests
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this issue Mar 30, 2021
… (PR MDAnalysis#2893)

* AtomGroup now are pickled/unpickled without looking for its anchored Universe
* removed core.universe._ANCHOR_UNIVERSES (originally introduced with MDAnalysis#293) and related machinery to keep global state. No global state is kept any more.
* update docs for atomgroup pickle
* update CHANGELOG
* update tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants