Pickling AtomGroups #293

mnmelo · 2015-06-01T14:55:20Z

Universe pickling might be somewhat tricky (#173), and unpickling might be a relatively heavy operation if it involves reloading trajectories.
I've run into the need to pickle Universes mostly as a side effect of actually wanting to pickle AtomGroups, in my case as a return from a multiprocessing.pool.map.

My idea was to provide lightweight pickling and unpickling functions that'd do the following:

Pickling: only atom indices are stored, together with some sort of heuristic for Universe identification (total number of atoms and trajectory filename, for instance).
Unpickling: run through globals() looking for a Universe that matches the heuristic. Recreate AtomGroup from the indices. (Raise an error if no suitable Universe is found).

Search could be sped up by saving with the pickle the Universe names that have the same id as the AtomGroup's, and then looking at those names first when unpickling:

universe_names = []
key, val = None, None #Extremely unpythonic, but these can't be created on the for loop otherwise the globals() dict gets changed.  
for key,val in globals().iteritems():
    if val is atom_group.universe:
        universe_names.append(key)

The whole thing feels quite unpythonic, and granted can be easily worked around by just passing back and forth atom index lists instead of AtomGroups. But it'd make MDAnalysis work better and more natively with other code that requires pickling.

Opinions?

The text was updated successfully, but these errors were encountered:

mnmelo · 2015-06-01T15:14:31Z

Correction: I guess the Universe should be looked up in locals() instead.

richardjgowers · 2015-06-01T15:22:26Z

Can't an AtomGroup access self.universe.filename and self.universe.trajectory.filename to rebuild the Universe it started from? Maybe it also needs self.universe.trajectory.frame too, to make sure that it's at the same frame?

richardjgowers · 2015-06-01T15:31:37Z

And I think the heuristic for Universe could even be some sort of __hash__ for the Universe, something like hash(u.filename) + hash(u.trajectory.filename)

mnmelo · 2015-06-01T15:36:22Z

Yes, that's what I was suggesting for the heuristic, but my point is that rebuilding the Universe is probably slow and even unnecessary if one already is loaded in the unpickling scope.

richardjgowers · 2015-06-01T15:54:23Z

Hmm yeah, searching locals for the Universe makes most sense... but then if many AtomGroups refer to the same Universe, and they want different frames.. won't this break things? Ie.. don't AtomGroups want a "lock" on the Universe frame?

mnmelo · 2015-06-01T16:00:31Z

Hmm... Should they carry such locks? My view of AtomGroups is that they're independent of the frame. Would users expect the AtomGroup.positions to point to those of the originating frame?
I think the current interface makes it quite clear that it's up to the user to know at which frame the Universe.trajectory is at, and that AtomGroups reflect the positions of the current frame.

richardjgowers · 2015-06-01T16:19:19Z

Sorry yeah, I'm imagining you want to parallelise by doing AtomGroups individually.. But if you then wanted to iterate the trajectory for each AG, won't this go wrong?

for ag in ags_to_do:  # if you pool here..
    for ts in u.trajectory:  # then this will be shared amongst all AGs
        # process ag

So I guess you just need to make sure that nobody changes the frame, so you need to order your loops like this:

for ts in u.trajectory:  # iterate the trajectory outside the parallelism
    for ag in ags_to_do:  # split work here
        # process ag

I think you're right so long as the loops are ordered correctly.

mnmelo · 2015-06-03T19:31:36Z

Just tried out some code and it turns out that what I'm proposing is not going to fly because even globals() will only allow access to module-level globals.

This means that the following code won't work:

syst = MDAnalysis.Universe("conf.gro")
ag = syst.atoms[:10]
pkl = pickle.dumps(ag) #pickling can generate a hash from filenames and number of atoms
                       #but can't look through the main module's variables
                       #to get the names of the AtomGroup's Universe

ag2 = pickle.loads(pkl) #unpickling takes place in the AtomGroup class
                        #and only variables at that module level are visible through globals();
                        #'syst' will never be found.

Introspection, via the inspect package, might work, but is extremely hackish, not to mention probably unportable.

But I just came up with an idea that might work:
We could keep a list in the MDAnalysis module of weakrefsto every Universe that is created. Unpickling just has to go through that list. No name guessing would be involved.

richardjgowers · 2015-06-03T20:03:42Z

So as Universes are created, they're stored in the MDAnalysis module that has been imported? So as AtomGroups are unpickled they search through the local MDAnalysis scope looking for Universes? Sounds sane enough, but I'm not sure how this is all "meant" to be done.

mnmelo · 2015-06-03T20:09:16Z

Exactly. I just came across the weakref module and it seems to do all we need.

mnmelo · 2015-06-04T01:04:41Z

It does work (yay!), with an issue:
While the weakrefs can deal with Universes being deleted or going out of scope, they will still point to a valid object until garbage is collected.
The problem with this is that it may happen that an unpickled AtomGroup gets bound to a stale Universe that matches the control heuristic. This might happen if a user rebuilds the same Universe repeatedly (is this a realistic use case?).

A solution is to call gc.collect() manually when unpickling. I'm not sure if this is desirable. I've pushed a branch feature-pickleAtomGroup (no unit testing yet) that works fine with this approach. Please comment.

An alternative would be to register a __del__ function with the Universe class that removes the weak reference from the built Universe list when the Universe gets deleted. This turns out to be a major pain, as @orbeckst dutifully warns in AtomGroup.py:

    # NOTE: DO NOT ADD A __del__() method: it somehow keeps the Universe
    #       alive during unit tests and the unit tests run out of memory!
    #### def __del__(self): <------ do not add this! [orbeckst]

For reference, the problem here seems to be that when doing this we'll have Universe and Reader instances mutually referencing each other while also having __del__ destructors. Python is then unable to decide on a cleanup order since it does not know which attributes are needed by each __del__ function.
This is probably matter for another issue. Maybe it's simple to prevent this reference cycle, but I couldn't find where the Reader instance references the Universe.

richardjgowers · 2015-06-04T07:10:50Z

Re: Universe.__del__, I've always meant to try and reimplement that, it should be possible!

Your feature branch looks good, but you need to merge develop into it so that it has all the travis (automatic testing) settings. Then if you open a pull request, it'll do lots of cool stuff automagically, eg:

#294

mnmelo · 2015-06-16T10:39:16Z

I think I can imagine an application that breaks the auto assignment of a universe when unpickling AtomGroups:

Let's say you want to do an all-to-all trajectory calculation. You load two universes with the same topology and filename. If you then pickle/unpickle an AtomGroup you're in trouble, irrespective of the __del__ issues mentioned earlier, because there's no way to automatically know which Universe you want it on. You also can't specify it when unpickling, so you're stuck with checking AtomGroup.universe is desired_universe afterwards.

My solution to this would be the following:
Add an optional flag is_atomgroup_anchor=True to the initialization of the Universe. Only Universes that are anchors will be added to that module-level weakref list.
It is then up to the user to set is_atomgroup_anchor=False on Universes they know might conflict with the desired one. Still not failsafe (what if you want some AtomGroups in universe1 and some in universe2?) but I think we can live with those fringe cases.

In addition, Universe methods set_atomgroup_anchor and unset_atomgroup_anchor can be added for when a user wants to add/remove the Universe to/from the anchor list, for finer control after Universe creation. This also puts the burden on the user to unset the anchor on Universes that are going out of scope but might become problematic later if an unpickling occurs before the garbage collection.
None of this should break anything, and even the simpler (and probably more common) cases of pickling/unpickling AtomGroups would happen without the user even having to explicitly use any of the anchoring features.
So, shall I go ahead and implement this?

orbeckst · 2015-06-16T22:37:19Z

Maybe I am missing key points after skimming the thread but it seems that what you want is a unique ID for each Universe so that if you re-instantiate a universe with the same ID you can attach the same AtomGroups again.

I might not get the use case that you have in mind but it somewhat reminds me of issues that @dotsdl is solving with his MDSynthesis persistence framework in a different manner.

An implementation might be worthwhile, just to show how it works. As @richardjgowers said, make sure you rebase against the latest develop and then generate a pull request. It will get unit tested by travis and provides a convenient frame work for code discussion.

…verses AtomGroups can now be pickled/unpickled (closes #293) A weakref system was implemented so that Universes can have a __del__ method without memleaking (closes #297) ChainReaders no longer leak (and also no longer have a __del__ method) (closes #312) Tests now specifically look for uncollectable objects at the end. Breaks parallelization, though.

… (PR #2893) * AtomGroup now are pickled/unpickled without looking for its anchored Universe * removed core.universe._ANCHOR_UNIVERSES (originally introduced with #293) and related machinery to keep global state. No global state is kept any more. * update docs for atomgroup pickle * update CHANGELOG * update tests

… (PR MDAnalysis#2893) * AtomGroup now are pickled/unpickled without looking for its anchored Universe * removed core.universe._ANCHOR_UNIVERSES (originally introduced with MDAnalysis#293) and related machinery to keep global state. No global state is kept any more. * update docs for atomgroup pickle * update CHANGELOG * update tests

mnmelo added enhancement Priority-Low labels Jun 1, 2015

orbeckst added the persistence label Jun 2, 2015

mnmelo mentioned this issue Jun 4, 2015

__del__ destructor for Universes? #297

Closed

orbeckst assigned mnmelo Jun 16, 2015

mnmelo mentioned this issue Jun 20, 2015

Travis failing intermittently #312

Closed

mnmelo mentioned this issue Jun 21, 2015

Lightweight serialization of AtomGroups by reutilization of built Universes #321

Merged

richardjgowers closed this as completed in #321 Jun 22, 2015

jdetle mentioned this issue May 23, 2016

Diffusion Map Implementation in MDAnalysis #857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickling AtomGroups #293

Pickling AtomGroups #293

mnmelo commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 3, 2015

richardjgowers commented Jun 3, 2015

mnmelo commented Jun 3, 2015

mnmelo commented Jun 4, 2015

richardjgowers commented Jun 4, 2015

mnmelo commented Jun 16, 2015

orbeckst commented Jun 16, 2015

Pickling AtomGroups #293

Pickling AtomGroups #293

Comments

mnmelo commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 1, 2015

richardjgowers commented Jun 1, 2015

mnmelo commented Jun 3, 2015

richardjgowers commented Jun 3, 2015

mnmelo commented Jun 3, 2015

mnmelo commented Jun 4, 2015

richardjgowers commented Jun 4, 2015

mnmelo commented Jun 16, 2015

orbeckst commented Jun 16, 2015