Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Center of mass/gravity for all residues of a ResidueGroup #1053

Closed
mimischi opened this issue Oct 26, 2016 · 13 comments
Closed

Center of mass/gravity for all residues of a ResidueGroup #1053

mimischi opened this issue Oct 26, 2016 · 13 comments

Comments

@mimischi
Copy link
Contributor

Expected behaviour

Calling the methods center_of_mass or center_of_gravity on a ResidueGroup I would expect to retrieve the CoM/CoG of each residue in an array:

In [1]: import MDAnalysis as mda

In [2]: from MDAnalysis.tests.datafiles import PSF, DCD

In [3]: u = mda.Universe(PSF, DCD)

In [4]: protein = u.select_atoms('protein')

In [5]: protein.residues.center_of_mass()
Out [5]: [ 10.52567673   9.49548312  -8.15335145]
[ 13.90618512   4.59222315  -7.35309049]
[ 8.12902093  6.03077052 -3.66130656]
....
[  7.76611535  16.02359343   0.44614907]
[  7.75536069  14.75541562  -4.00511608]
[  6.74699649  18.21034743  -6.73924321]

Actual behaviour

These methods return the CoM/CoG of the whole selection, just like calling protein.center_of_mass() or protein.atoms.center_of_mass():

In [1]: import MDAnalysis as mda

In [2]: from MDAnalysis.tests.datafiles import PSF, DCD

In [3]: u = mda.Universe(PSF, DCD)

In [4]: protein = u.select_atoms('protein')

In [5]: protein.residues.center_of_mass()
Out [5]: array([-0.01094035,  0.05727601, -0.12885778])

Current solution

This works, but seems somehow slow when called for each frame in a bigger system.

for res in protein.residues:
    print res.center_of_mass()

Currently version of MDAnalysis:

0.15.0

@orbeckst
Copy link
Member

@richardjgowers / @dotsdl how are we handling ResidueGroup.center_of_mass() in #363?

@richardjgowers
Copy link
Member

So currently *Group.center_of_mass == *Group.atoms.center_of_mass() like in develop. It could be changed however..

@orbeckst
Copy link
Member

There was some discussion in #411 on how "aggregate" functions should work at the SegmentGroup and ResidueGroup level. I think we concluded that singular method such as g.center_of_mass() should always be equivalent to g.atoms.center_of_mass().

I don't think we introduced "plural" methods in #363 such as g.centers_of_mass() that would do

np.array([member.center_of_mass() for member in g])

Admittedly, having the plural s inside the method name as I wrote here is pretty confusing although grammatically correct. There was also the idea #385 (comment) to use something like ResidueGroup.byres("center_of_mass") to force a calculation on a per-residue basis.

In all cases, we would need to think if there are additional performance enhancements that we could do (in addition to the syntactic sugar).

@mimischi , what's your primary concern: performance or ease of writing the operation?

@mimischi
Copy link
Contributor Author

@orbeckst I was wondering about the performance.

Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.

@orbeckst
Copy link
Member

Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.

At the moment I can't think of a nice way to calculate the C.O.M. for a bunch of residues without a loop.

residue_coms =  np.array([r.atoms.center_of_mass() for r in u.select_atoms("protein").residues])

(and this would probably be the under-the-hood implementation of the requested feature). Btw, you might be able to speed-up what you showed by using a list comprehension instead of an explicit for loop.

@tylerjereddy
Copy link
Member

I think the problem is a fair bit more tractable when all residues have the same shape (number of atoms). If we simplify for the case of the center of geometry (centroid) of each residue for an atomgroup with 12 residues, each of which has 10 atoms, then I suspect we'd want a data structure with this shape: (12, 10, 3). Then, I believe it should be possible to use a standard numpy vectorized ufunc on the appropriate axis to determine the means (centroids). For the center of mass, I suspect we'd just add (one more?) vectorized operation that (multiplies by?) includes the weights for each atom in an array of the same shape.

Ok, so what about for real data with heterogenously-sized residues (amino acids, lipids, etc.)? I suspect that we could simply NaN-fill to produce a homogenous array. Generally, the fast vectorized numpy operations will not work on jagged arrays (which are typed as object). Once the NaN filling is done, we can repeat the above steps for the homogenous array, as numpy tends to gracefully handle / ignore NaN for many array operations & so combining the weights and masses with the appropriate axes / slicing shouldn't be too bad.

The only thing I don't like about this approach is that we need to know the maximum number of atoms for any residue in the given atomgroup so that we can NaN-fill appropriately. In a biomolecular context, I suspect we might get away with a user-adjustable default value that exceeds any reasonable atom count for a conventional amino acid, lipid, or nucleotide / glycan. I wonder if there's a fast way that MDAnalysis could parse this out during i.e., universe / topology initialization -- the other core devs would likely know more about this. That would probably allow automated NaN-filling in a less-awkward / more robust & topology-aware manner.

@richardjgowers
Copy link
Member

Ragged arrays aren't too impossible with a few extra arrays

# The contents of each Group/Residue
g1 = [10, 12, 13, 14]
g2 = [51, 61, 71]
g3 = [8, 9, 10]

# The size of each group
sizes = [4, 3, 3]
# The contents of each group concatenated
identities = [10, 12, 13, 14, 51, 61, 71, 8, 9, 10] 


offset = 0
for s in sizes:  # loop over groups
    for i in range(s):  # loop over atoms in this group
        atom = identities[offset + i]
    offset += s

@tylerjereddy
Copy link
Member

Heterogenous arrays aren't impossible at all--I'm just targeting raw performance. That's a nested for loop in pure python, so I'm not sure it would fare much better than the original list comprehension approach. Could be cythonized though, I suppose.

Even with the Nan-fill approach, I'm still thinking that a bit of creativity would be needed to avoid any looping. I guess you'd preallocate an empty array of the appropriate shape for n_residues and then somehow assign / add in the heterogenous coordinates in such a way that the residue atoms don't overlap. But the original and target arrays have a different shape, so that may be trickier than I thought.

@orbeckst
Copy link
Member

The NaN filling sounds complicated and a bit brittle – aren't glycans single residues? Lipids are, and you can have big lipids.

I would try a hybrid approach for residues: You could group all residues with same number of atoms into arrays, remember the residue indices, work on these blocks, and then assemble results in the correct sequence.

For all waters this should give a good speed-up but even cutting down a protein with 500 residues into 20 blocks with 500/20 = 25 residues each you will probably see improvements.

For segments, where we typically only have O(1) - O(10) we can probably just do the list comprehension.

@orbeckst
Copy link
Member

Admittedly, having the plural s inside the method name as I wrote here – ag.centers_of_mass() — is pretty confusing.

We can be fancy and use ag.barycenters() (W:barycenter) :-).

@tylerjereddy
Copy link
Member

aren't glycans single residues?

Glycolipids are usually treated as a single residue in CG. I think that makes more sense than having the molecule split in two or more pieces topologically for various reasons (they are often parametrized as a custom-made unit, and we are often interested in their properties as a unit, etc.). I think for glycoproteins the glycans can be separate residues though, depending on the FF maybe.

The NaN filling sounds complicated and a bit brittle

Maybe. Hard to say without actually trying and comparing different approaches. I suspect there are tradeoffs in terms of performance gained and the assumptions you can introduce. A more brittle solution might be faster because it can make more assumptions or preallocate larger arrays for vectorization.

@richardjgowers
Copy link
Member

https://gist.github.com/richardjgowers/0a63f12fa207f26de201e586ee22f4d7

@mimischi I put together this to see what the fastest we can do is. It takes ~3.5 ms, and just constructing the AtomGroup from each residue is 3 ms, so it looks like the bottleneck is there now.

@orbeckst
Copy link
Member

orbeckst commented Jun 14, 2017

How to do the barycenter when you have a block of identical residues, eg all TIP4P waters with 4 atoms each:

from MDAnalysisTests.datafiles import TPR, XTC
waters = u.select_atoms("resname SOL")

natoms = 4
barycenters = (waters.positions * waters.masses[:, np.newaxis]).reshape(-1, natoms, 3).mean(axis=1) 
   / waters.masses.reshape(-1, natoms).sum(axis=1)[:, np.newaxis]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants