-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization of AtomGroup #2893
Serialization of AtomGroup #2893
Conversation
I may miss the point in #321 about lightweight serialization. Don't we always have to serialize (or recreate) a And due to the nature of One advantage of it is: for reference Besides, another advantage to ditch anchoring is: we no longer need to worry about the order of >>> def print_id(ag, u):
>>> print('id u:',id(u))
>>> print('id ag._u:',id(ag._u))
>>> print('')
>>> ag = u.atoms[2:5]
>>> print('before pickle:')
>>> print('id u:',id(u))
>>> print('id ag_u:',id(ag._u))
>>> print('')
>>> print('after pickle:')
>>> p = multiprocessing.Pool(2)
>>> i = 0
>>> res = np.array([p.apply(print_id, args=(ag, u)) # this won't work in the old version with AtomGroup before Universe.
>>> for ts in u.trajectory[:3]])
>>> p.close()
before pickle:
id u: 139955715000496
id ag_u: 139955715000496
after pickle:
id u: 139956140148912
id ag._u: 139956140148912
id u: 139956140148912
id ag._u: 139956140148912
id u: 139955676441232
id ag._u: 139955676441232
old version:
before pickle:
id u: 139990197591440
id ag_u: 139990197591440
after pickle:
id u: 139990190787024
id ag._u: 139990197591440
id u: 139990190788224
id ag._u: 139990197591440
id u: 139990202526352
id ag._u: 139990190787024 |
Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-08-19 08:56:48 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
Codecov Report
@@ Coverage Diff @@
## develop #2893 +/- ##
===========================================
+ Coverage 92.91% 92.93% +0.02%
===========================================
Files 187 187
Lines 24591 24548 -43
Branches 3185 3182 -3
===========================================
- Hits 22848 22814 -34
+ Misses 1697 1688 -9
Partials 46 46
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditching the anchors makes sense, now that we can serialize universes. Your new approach is more streamlined and transparent than keeping state in _ANCHOR_UNIVERSES
.
But I don't understand why we still need
package/MDAnalysis/__init__.py:_ANCHOR_UNIVERSES = weakref.WeakValueDictionary()
and any of the other occurrences of _ANCHOR_UNIVERSES
: can't we get rid of all of them now?
EDIT: Or do we keep anchoring to avoid having duplicated universes?
EDIT 2: ... yes.
I am not sure how it helps to prevent duplicated universes, could you elaborate a bit more? Also I am not certain whether it's fine to ditch the anchor completely--all the tests still pass without that. With that said, in #297, it also mentioned that it helps the mem leaks issue. (is that tested anywhere?) |
We used to be able to test for memleaks when we were using nose for testing but didn't manage to port it to pytest. I don't remember why. So we are not really testing for memleaks – this would be a useful thing to look at. I don't think that the ANCHOR_UNIVERSES were related to #297, though, that was more about weakrefs inside Universe. The ANCHOR_UNIVERSES came from PR #321.
You're right, it doesn't. I had thought that serializing a universe and unpacking it in the following way import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
u_copy = pickle.loads(pickle.dumps(u)) would make u is u_copy # nope but that's When doing it with a group |
@yuxuanzhuang , with this PR, what does the following code do? import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
g = u.atoms
h = u.atoms[:10]
g_copy = pickle.loads(pickle.dumps(g))
h_copy = pickle.loads(pickle.dumps(h))
print("g_copy.universe:", u is g_copy.universe)
print("h_copy:", u is h_copy.universe) I.e., do groups attach to their own universe if it is already present? I vaguely remember that you had already checked that previously but just wanted to confirm. |
Assuming all looking good then try removing any remnants of ANCHOR_UNIVERSE. |
No it is not after ditching anchoring. import pickle
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
g = u.atoms
h = u.atoms[:10]
g_copy = pickle.loads(pickle.dumps(g))
h_copy = pickle.loads(pickle.dumps(h))
print("g_copy.universe is u:", u is g_copy.universe)
print("h_copy.universe is u:", u is h_copy.universe)
print("g_copy.universe is h_copy.universe", g_copy.universe == h_copy.universe)
g_copy.universe is u: False
h_copy.universe is u: False
g_copy.universe is h_copy.universe False
g_copy, h_copy, u_copy = pickle.loads(pickle.dumps([g, h, u]))
print("g_copy.universe is u_copy:", u_copy is g_copy.universe)
print("h_copy.universe is u_copy:", u_copy is h_copy.universe)
print("g_copy.universe is h_copy.universe", g_copy.universe == h_copy.universe)
g_copy.universe is u_copy: True
h_copy.universe is u_copy: True
g_copy.universe is h_copy.universe True They have to be pickled together. Is there any use cases for the former condition? |
No performance difference is observed when using pmda: u = mda.Universe('serialize_mda/benchmarking/YiiP_system.pdb','serialize_mda/benchmarking/YiiP_system_90ns_center.xtc')
ref = mda.Universe('serialize_mda/benchmarking/YiiP_system.pdb')
rmsd_ana = pmda.rms.RMSD(u.atoms,ref.atoms)
rmsd_ana.run(n_jobs=12, n_blocks=12)
print(rmsd_ana.timing._total) Before: 26.77 |
Thanks for the illuminating examples. I've been thinking more about the question if unpickling an AtomGroup should attach to an existing Universe or create a new one.
I don't really envisage a use where I would keep serializing separate AtomGroups but users come up with all kinds of creative solutions so I can see people getting confused if they find that two AtomGroups are now diverging in their history (and don't update when iterating over the trajectory, for instance). On the other hand, I would say from a Python point of view, serialize-deserialize should behave like a deepcopy and with reattaching to existing universes we break this expectation. Furthermore, carrying global state around in ANCHOR_UNIVERSES is awkward. With all of this I am now strongly in favor of removing all ANCHOR_UNIVERSES stuff and be very explicit about what happens when we serialize:
If the docs don't say this already explicitly then that should be added. @MDAnalysis/coredevs any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of removing _ANCHOR_UNIVERSES
.
Please add documentation that makes clear how ags are to be pickled.
… (PR MDAnalysis#2893) * AtomGroup now are pickled/unpickled without looking for its anchored Universe * removed core.universe._ANCHOR_UNIVERSES (originally introduced with MDAnalysis#293) and related machinery to keep global state. No global state is kept any more. * update docs for atomgroup pickle * update CHANGELOG * update tests
Fixes #
Changes made in this Pull Request:
AtomGroup
now is pickled without finding its anchoredUniverse
PR Checklist