-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Center of mass/gravity for all residues of a ResidueGroup #1053
Comments
@richardjgowers / @dotsdl how are we handling |
So currently |
There was some discussion in #411 on how "aggregate" functions should work at the SegmentGroup and ResidueGroup level. I think we concluded that singular method such as I don't think we introduced "plural" methods in #363 such as np.array([member.center_of_mass() for member in g]) Admittedly, having the plural s inside the method name as I wrote here is pretty confusing although grammatically correct. There was also the idea #385 (comment) to use something like In all cases, we would need to think if there are additional performance enhancements that we could do (in addition to the syntactic sugar). @mimischi , what's your primary concern: performance or ease of writing the operation? |
@orbeckst I was wondering about the performance. Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task. |
At the moment I can't think of a nice way to calculate the C.O.M. for a bunch of residues without a loop. residue_coms = np.array([r.atoms.center_of_mass() for r in u.select_atoms("protein").residues]) (and this would probably be the under-the-hood implementation of the requested feature). Btw, you might be able to speed-up what you showed by using a list comprehension instead of an explicit |
I think the problem is a fair bit more tractable when all residues have the same shape (number of atoms). If we simplify for the case of the center of geometry (centroid) of each residue for an atomgroup with 12 residues, each of which has 10 atoms, then I suspect we'd want a data structure with this shape: (12, 10, 3). Then, I believe it should be possible to use a standard numpy vectorized ufunc on the appropriate axis to determine the means (centroids). For the center of mass, I suspect we'd just add (one more?) vectorized operation that (multiplies by?) includes the weights for each atom in an array of the same shape. Ok, so what about for real data with heterogenously-sized residues (amino acids, lipids, etc.)? I suspect that we could simply The only thing I don't like about this approach is that we need to know the maximum number of atoms for any residue in the given atomgroup so that we can |
Ragged arrays aren't too impossible with a few extra arrays # The contents of each Group/Residue
g1 = [10, 12, 13, 14]
g2 = [51, 61, 71]
g3 = [8, 9, 10]
# The size of each group
sizes = [4, 3, 3]
# The contents of each group concatenated
identities = [10, 12, 13, 14, 51, 61, 71, 8, 9, 10]
offset = 0
for s in sizes: # loop over groups
for i in range(s): # loop over atoms in this group
atom = identities[offset + i]
offset += s |
Heterogenous arrays aren't impossible at all--I'm just targeting raw performance. That's a nested for loop in pure python, so I'm not sure it would fare much better than the original list comprehension approach. Could be cythonized though, I suppose. Even with the |
The NaN filling sounds complicated and a bit brittle – aren't glycans single residues? Lipids are, and you can have big lipids. I would try a hybrid approach for residues: You could group all residues with same number of atoms into arrays, remember the residue indices, work on these blocks, and then assemble results in the correct sequence. For all waters this should give a good speed-up but even cutting down a protein with 500 residues into 20 blocks with 500/20 = 25 residues each you will probably see improvements. For segments, where we typically only have O(1) - O(10) we can probably just do the list comprehension. |
We can be fancy and use |
Glycolipids are usually treated as a single residue in CG. I think that makes more sense than having the molecule split in two or more pieces topologically for various reasons (they are often parametrized as a custom-made unit, and we are often interested in their properties as a unit, etc.). I think for glycoproteins the glycans can be separate residues though, depending on the FF maybe.
Maybe. Hard to say without actually trying and comparing different approaches. I suspect there are tradeoffs in terms of performance gained and the assumptions you can introduce. A more brittle solution might be faster because it can make more assumptions or preallocate larger arrays for vectorization. |
https://gist.github.com/richardjgowers/0a63f12fa207f26de201e586ee22f4d7 @mimischi I put together this to see what the fastest we can do is. It takes ~3.5 ms, and just constructing the AtomGroup from each residue is 3 ms, so it looks like the bottleneck is there now. |
How to do the barycenter when you have a block of identical residues, eg all TIP4P waters with 4 atoms each: from MDAnalysisTests.datafiles import TPR, XTC
waters = u.select_atoms("resname SOL")
natoms = 4
barycenters = (waters.positions * waters.masses[:, np.newaxis]).reshape(-1, natoms, 3).mean(axis=1)
/ waters.masses.reshape(-1, natoms).sum(axis=1)[:, np.newaxis] |
Expected behaviour
Calling the methods
center_of_mass
orcenter_of_gravity
on a ResidueGroup I would expect to retrieve the CoM/CoG of each residue in an array:Actual behaviour
These methods return the CoM/CoG of the whole selection, just like calling
protein.center_of_mass()
orprotein.atoms.center_of_mass()
:Current solution
This works, but seems somehow slow when called for each frame in a bigger system.
Currently version of MDAnalysis:
0.15.0
The text was updated successfully, but these errors were encountered: