-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nuclinfo.wc_pair extremely slow #3310
Comments
Did you measure the time it takes to analyze one base pair (e.g., using Looking at the code mdanalysis/package/MDAnalysis/analysis/nuclinfo.py Lines 144 to 150 in 6b4eea0
In short, this module is in need of a rewrite (using our "modern" AnalysisBase approach). If you (or anyone else) interested in working on it then please do: code contributions are very welcome — MDAnalysis is open source and driven by the users, in particular by users volunteering their time. The best person to write such code is a scientist who needs the code :-). The User Guide tells you how to get started contributing to MDAnalysis. |
Given that you're saying it gets slower / hangs as it progresses, I'm wondering if you're maybe running out of memory? Do you have any insights on how the memory usage is increasing (i.e. from monitoring task manager)? |
Thanks for the timely feedback! Sorry I should have thought to include more detailed profiling in my original post. I've done several tests and it looks like
I suspect now what I thought was the code 'hanging' before was just it taking a long time to run 40x40x10000 times, sorry for the confusion on that. Since this is such a neat/useful function I'll also have a look at the source and maybe see what I can do to speed it up. |
One possibility for improving and modernizing this code would be to write a modern Analysis class (based on analysis.base.AnalysisBase) that makes the required selections once during the We did something similar for the Dihedral analysis and got a speed up of ~100x compared to the previous naive approach. |
is this something that there is still interest in improving? I've seen how dihedral improves speed and I don't think it would be too hard to replicate that here to improve performance. edit: adding more details |
@orbeckst I took your advice in May of last year and wrote it up as the below in my OpenDNA (now E2EDNA2) code. Indeed, it runs extremely fast now. I didn't think it was general enough as-written to port over to MDA. I had always meant to clean it up for this purpose but never got around to it. See here: class atomDistances(AnalysisBase):
def __init__(self, atomgroups, **kwargs):
"""
:param atomgroups: a list of atomgroups for which the interatom bond lengths are calculated
"""
super(atomDistances, self).__init__(atomgroups[0].universe.trajectory, **kwargs)
self.atomgroups = atomgroups
self.ag1 = atomgroups[0]
self.ag2 = atomgroups[1]
def _prepare(self):
self.results = []
def _single_frame(self):
distance = calc_bonds(self.ag1.positions, self.ag2.positions, box=self.ag1.dimensions)
self.results.append(distance)
def _conclude(self):
self.results = np.asarray(self.results) |
@InfluenceFunctional thanks for sharing. Even though I agree that the atomDistances() class is simple, it is useful, and importantly, the fact that it is only a few lines of codes that nevertheless correctly carry out the task at hand is great. Many people calculate distances, and often it's done incorrectly or slower than necessary. As such, I actually see value in adding something like your class to the analysis.distances module. I would open an issue for this. At the moment, we have many GSOC and Outreachy applicants here who are looking for manageable and well-defined issues and this would make for a great starter issue. May I include your code snippet as initial inspiration? Specifically, would you allow the code snippet in #3310 (comment) to be included in MDAnalysis or used as the basis for code with such functionality? |
@orbeckst please by all means go ahead and use it in whatever way it could be useful! def getWCDistTraj(u):
"""
use the atomDistances class to calculate the WC base pairing distances between all bases on a sequence
:param u:
:return:
"""
n_bases = u.segments[0].residues.n_residues
atomIndices1 = np.zeros((n_bases, n_bases))
atomIndices2 = np.zeros_like(atomIndices1)
# identify relevant atoms for WC distance calculation
for i in range(1, n_bases + 1):
for j in range(1, n_bases + 1):
if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
a1, a2 = "N3", "N1"
if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
a1, a2 = "N1", "N3"
atoms = u.select_atoms("(resid {0!s} and name {1!s}) or (resid {2!s} and name {3!s}) ".format(i, a1, j, a2))
atomIndices1[i - 1, j - 1] = atoms[0].id # bond-mate 1
if i == j:
atomIndices2[i - 1, j - 1] = atoms[0].id # if it's the same base, we want the resulting distance to always be zero
else:
atomIndices2[i - 1, j - 1] = atoms[1].id # bond-mate 2
# make a flat list of every combination
bonds = [mda.AtomGroup(atomIndices1.flatten(), u), mda.AtomGroup(atomIndices2.flatten(), u)]
na = atomDistances(bonds).run()
traj = na.results.reshape(u.trajectory.n_frames, n_bases, n_bases)
return traj |
Thanks, @InfluenceFunctional! Btw, if you want your software be featured on https://www.mdanalysis.org/pages/used-by/ open a PR in https://github.com/MDAnalysis/MDAnalysis.github.io/ by editing https://github.com/MDAnalysis/MDAnalysis.github.io/blob/master/pages/used-by.md |
I'm using analysis.nuclinfo.wc_pair to profile base pair distances over long trajectories + long DNA sequences. It seems to be rather fast over ~1-100 executions, but when evaluated over a loop e.g.,
for t in all time_steps:
for i in all bases:
for j in all bases:
distance[t,i,j]=nuclinfo.wc_pair(etc.)
When the total number of wc_pair evaluations reaches ~1e3 (rough guess), evaluation gets extremely slow, or indeed hangs and will not finish at all on my platform. (MDAnalysis 1.0.0 on python 3.8.5 on Windows).
Any guidance at all on why this might be the case would be very much appreciated. It evaluates rather quickly for few-iterations, but seems to hang when queried repeatedly over thousands of base pairs and time steps. This is an extremely useful function for my work in-principle, but not if it keeps hanging for long trajectories and/or large DNA sequences.
The text was updated successfully, but these errors were encountered: