Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of AtomGroup #2893

Merged
merged 13 commits into from
Aug 21, 2020

Conversation

yuxuanzhuang
Copy link
Contributor

@yuxuanzhuang yuxuanzhuang commented Aug 9, 2020

Fixes #

Changes made in this Pull Request:

  • AtomGroup now is pickled without finding its anchored Universe

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Aug 9, 2020

I may miss the point in #321 about lightweight serialization. Don't we always have to serialize (or recreate) a Universe in the new process? In that regard, there's no literal difference.

And due to the nature of pickle, one Universe instance will only be pickled once during serialization. So in the new process, all AtomGroup will still be bound to the same Universe (not the old one though, but I don't think it is a problem).

One advantage of it is: for reference AtomGroup (that is independent from the main Universe) in e.g. RMSD, we don't need to worry about it not being able to pickled because of the missing of its bound Universe.

Besides, another advantage to ditch anchoring is: we no longer need to worry about the order of Universe and AtomGroup during serialization:

>>> def print_id(ag, u):
>>>     print('id u:',id(u))
>>>     print('id ag._u:',id(ag._u))
>>>     print('')


>>> ag = u.atoms[2:5]
>>> print('before pickle:')
>>> print('id u:',id(u))
>>> print('id ag_u:',id(ag._u))
>>> print('')
>>> print('after pickle:')


>>> p = multiprocessing.Pool(2)
>>> i = 0
>>> res = np.array([p.apply(print_id, args=(ag, u))  #  this won't work in the old version with AtomGroup before Universe.
>>>             for ts in u.trajectory[:3]])
>>> p.close()

before pickle:
id u: 139955715000496
id ag_u: 139955715000496

after pickle:
id u: 139956140148912
id ag._u: 139956140148912

id u: 139956140148912
id ag._u: 139956140148912

id u: 139955676441232
id ag._u: 139955676441232

old version:

before pickle:
id u: 139990197591440
id ag_u: 139990197591440

after pickle:
id u: 139990190787024
id ag._u: 139990197591440

id u: 139990190788224
id ag._u: 139990197591440

id u: 139990202526352
id ag._u: 139990190787024

@pep8speaks
Copy link

pep8speaks commented Aug 12, 2020

Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-19 08:56:48 UTC

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #2893 into develop will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2893      +/-   ##
===========================================
+ Coverage    92.91%   92.93%   +0.02%     
===========================================
  Files          187      187              
  Lines        24591    24548      -43     
  Branches      3185     3182       -3     
===========================================
- Hits         22848    22814      -34     
+ Misses        1697     1688       -9     
  Partials        46       46              
Impacted Files Coverage Δ
package/MDAnalysis/__init__.py 91.89% <ø> (-0.61%) ⬇️
package/MDAnalysis/core/groups.py 98.58% <100.00%> (-0.01%) ⬇️
package/MDAnalysis/core/universe.py 97.45% <100.00%> (+0.25%) ⬆️
__init__.py 91.89% <0.00%> (-0.61%) ⬇️
package/MDAnalysis/coordinates/base.py 95.12% <0.00%> (+0.48%) ⬆️
package/MDAnalysis/due.py 75.00% <0.00%> (+8.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23074c6...c82c487. Read the comment docs.

@orbeckst orbeckst added the GSoC GSoC project label Aug 12, 2020
@orbeckst orbeckst self-requested a review August 12, 2020 16:35
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditching the anchors makes sense, now that we can serialize universes. Your new approach is more streamlined and transparent than keeping state in _ANCHOR_UNIVERSES.

But I don't understand why we still need

package/MDAnalysis/__init__.py:_ANCHOR_UNIVERSES = weakref.WeakValueDictionary()

and any of the other occurrences of _ANCHOR_UNIVERSES: can't we get rid of all of them now?

EDIT: Or do we keep anchoring to avoid having duplicated universes?

EDIT 2: ... yes.

@yuxuanzhuang
Copy link
Contributor Author

I am not sure how it helps to prevent duplicated universes, could you elaborate a bit more?

Also I am not certain whether it's fine to ditch the anchor completely--all the tests still pass without that.

With that said, in #297, it also mentioned that it helps the mem leaks issue. (is that tested anywhere?)

@orbeckst
Copy link
Member

orbeckst commented Aug 14, 2020

We used to be able to test for memleaks when we were using nose for testing but didn't manage to port it to pytest. I don't remember why. So we are not really testing for memleaks – this would be a useful thing to look at.

I don't think that the ANCHOR_UNIVERSES were related to #297, though, that was more about weakrefs inside Universe. The ANCHOR_UNIVERSES came from PR #321.

I am not sure how it helps to prevent duplicated universes, could you elaborate a bit more?

You're right, it doesn't. I had thought that serializing a universe and unpacking it in the following way

import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
u_copy = pickle.loads(pickle.dumps(u))

would make u and u_copy magically the same

u is u_copy    # nope

but that's False.

When doing it with a group g = u.atoms then the new group g_copy = loads(dumps(g)) correctly attaches to the old universe: (u is g_copy.universe) is True but that also works with your new approach, right?

@orbeckst
Copy link
Member

@yuxuanzhuang , with this PR, what does the following code do?

import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
g = u.atoms
h = u.atoms[:10]
g_copy = pickle.loads(pickle.dumps(g))
h_copy = pickle.loads(pickle.dumps(h))

print("g_copy.universe:", u is g_copy.universe)
print("h_copy:", u is h_copy.universe)

I.e., do groups attach to their own universe if it is already present?

I vaguely remember that you had already checked that previously but just wanted to confirm.

@orbeckst
Copy link
Member

Assuming all looking good then try removing any remnants of ANCHOR_UNIVERSE.

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Aug 14, 2020

No it is not after ditching anchoring.

import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
g = u.atoms
h = u.atoms[:10]
g_copy = pickle.loads(pickle.dumps(g))
h_copy = pickle.loads(pickle.dumps(h))

print("g_copy.universe is u:", u is g_copy.universe)
print("h_copy.universe is u:", u is h_copy.universe)
print("g_copy.universe is h_copy.universe", g_copy.universe == h_copy.universe)

g_copy.universe is u: False
h_copy.universe is u: False
g_copy.universe is h_copy.universe False

g_copy, h_copy, u_copy = pickle.loads(pickle.dumps([g, h, u]))

print("g_copy.universe is u_copy:", u_copy is g_copy.universe)
print("h_copy.universe is u_copy:", u_copy is h_copy.universe)
print("g_copy.universe is h_copy.universe", g_copy.universe == h_copy.universe)

g_copy.universe is u_copy: True
h_copy.universe is u_copy: True
g_copy.universe is h_copy.universe True

They have to be pickled together. Is there any use cases for the former condition?

@yuxuanzhuang
Copy link
Contributor Author

No performance difference is observed when using pmda:

u = mda.Universe('serialize_mda/benchmarking/YiiP_system.pdb','serialize_mda/benchmarking/YiiP_system_90ns_center.xtc')
ref = mda.Universe('serialize_mda/benchmarking/YiiP_system.pdb')
rmsd_ana = pmda.rms.RMSD(u.atoms,ref.atoms)
rmsd_ana.run(n_jobs=12, n_blocks=12)
print(rmsd_ana.timing._total)

Before: 26.77
Afrer: 26.20

@orbeckst
Copy link
Member

Thanks for the illuminating examples.

I've been thinking more about the question if unpickling an AtomGroup should attach to an existing Universe or create a new one.

Is there any use cases for the former condition?

I don't really envisage a use where I would keep serializing separate AtomGroups but users come up with all kinds of creative solutions so I can see people getting confused if they find that two AtomGroups are now diverging in their history (and don't update when iterating over the trajectory, for instance).

On the other hand, I would say from a Python point of view, serialize-deserialize should behave like a deepcopy and with reattaching to existing universes we break this expectation. Furthermore, carrying global state around in ANCHOR_UNIVERSES is awkward.

With all of this I am now strongly in favor of removing all ANCHOR_UNIVERSES stuff and be very explicit about what happens when we serialize:

  • You always get an independent Universe.
  • If you have multiple related objects then you must serialize them in one go (e.g. as a tuple).

If the docs don't say this already explicitly then that should be added.

@MDAnalysis/coredevs any thoughts?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of removing _ANCHOR_UNIVERSES.

Please add documentation that makes clear how ags are to be pickled.

@orbeckst orbeckst self-assigned this Aug 21, 2020
@orbeckst orbeckst merged commit 2f0381e into MDAnalysis:develop Aug 21, 2020
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
… (PR MDAnalysis#2893)

* AtomGroup now are pickled/unpickled without looking for its anchored Universe
* removed core.universe._ANCHOR_UNIVERSES (originally introduced with MDAnalysis#293) and related machinery to keep global state. No global state is kept any more.
* update docs for atomgroup pickle
* update CHANGELOG
* update tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants