Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nuclinfo.wc_pair extremely slow #3310

Closed
InfluenceFunctional opened this issue May 20, 2021 · 10 comments · Fixed by #3611
Closed

nuclinfo.wc_pair extremely slow #3310

InfluenceFunctional opened this issue May 20, 2021 · 10 comments · Fixed by #3611

Comments

@InfluenceFunctional
Copy link

I'm using analysis.nuclinfo.wc_pair to profile base pair distances over long trajectories + long DNA sequences. It seems to be rather fast over ~1-100 executions, but when evaluated over a loop e.g.,
for t in all time_steps:
for i in all bases:
for j in all bases:
distance[t,i,j]=nuclinfo.wc_pair(etc.)

When the total number of wc_pair evaluations reaches ~1e3 (rough guess), evaluation gets extremely slow, or indeed hangs and will not finish at all on my platform. (MDAnalysis 1.0.0 on python 3.8.5 on Windows).

Any guidance at all on why this might be the case would be very much appreciated. It evaluates rather quickly for few-iterations, but seems to hang when queried repeatedly over thousands of base pairs and time steps. This is an extremely useful function for my work in-principle, but not if it keeps hanging for long trajectories and/or large DNA sequences.

@orbeckst
Copy link
Member

Did you measure the time it takes to analyze one base pair (e.g., using %timeit) and then measured the time for a N basepairs? Is the time for N basepairs N times the time for one? Similarly, if you measure the time for one trajectory frame and M frames, does the time scale linearly with M? If it grows faster than linearly, then there's a problem somewhere. If it grows linearly, it's just slow.

Looking at the code

if universe.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
a1, a2 = "N3", "N1"
if universe.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
a1, a2 = "N1", "N3"
wc_dist = universe.select_atoms("(segid {0!s} and resid {1!s} and name {2!s}) "
"or (segid {3!s} and resid {4!s} and name {5!s}) "
.format(seg1, i, a1, seg2, bp, a2))
I see it doing a number of atom selections for every step and every base pair, which is admittedly not optimized for trajectory analysis, so it's likely slow. (This is one of the older modules in MDAnalysis...). A solution would be to rewrite the analysis with atom selections that are determined before iterating over frames. Furthermore, the distance calculations for all basepairs could be done in one calculation (and also take PBC into account, which they don't do at the moment), further speeding it up.

In short, this module is in need of a rewrite (using our "modern" AnalysisBase approach). If you (or anyone else) interested in working on it then please do: code contributions are very welcome — MDAnalysis is open source and driven by the users, in particular by users volunteering their time. The best person to write such code is a scientist who needs the code :-).

The User Guide tells you how to get started contributing to MDAnalysis.

@IAlibay
Copy link
Member

IAlibay commented May 20, 2021

@InfluenceFunctional

Given that you're saying it gets slower / hangs as it progresses, I'm wondering if you're maybe running out of memory? Do you have any insights on how the memory usage is increasing (i.e. from monitoring task manager)?

@InfluenceFunctional
Copy link
Author

@IAlibay @orbeckst

Thanks for the timely feedback! Sorry I should have thought to include more detailed profiling in my original post. I've done several tests and it looks like

  1. Memory usage is constant.
  2. Execution time scales linearly with the length of the trajectory, which makes sense.
  3. Execution time scales ~quadratically with the length of the DNA sequence, which also makes sense, since I'm computing all NxN base pair distances.

I suspect now what I thought was the code 'hanging' before was just it taking a long time to run 40x40x10000 times, sorry for the confusion on that.

Since this is such a neat/useful function I'll also have a look at the source and maybe see what I can do to speed it up.

@orbeckst
Copy link
Member

One possibility for improving and modernizing this code would be to write a modern Analysis class (based on analysis.base.AnalysisBase) that makes the required selections once during the _prepare() step and then uses fast distance calculations (namely lib.distances.calc_bonds()) to calculate all distances per time step. Using calc_bonds will also ensure that PBC are correctly taken into account.

We did something similar for the Dihedral analysis and got a speed up of ~100x compared to the previous naive approach.

@ALescoulie
Copy link
Contributor

ALescoulie commented Apr 3, 2022

is this something that there is still interest in improving? I've seen how dihedral improves speed and I don't think it would be too hard to replicate that here to improve performance.

edit: adding more details

@orbeckst
Copy link
Member

orbeckst commented Apr 3, 2022

I think we would really like this whole module to be transitioned to AnalysisBase style (I opened #3600) before working on it in earnest. If you want to tackle #3600 then please go for it and then you could always come back here.

@InfluenceFunctional
Copy link
Author

InfluenceFunctional commented Apr 3, 2022

@orbeckst I took your advice in May of last year and wrote it up as the below in my OpenDNA (now E2EDNA2) code. Indeed, it runs extremely fast now.

I didn't think it was general enough as-written to port over to MDA. I had always meant to clean it up for this purpose but never got around to it.

See here:
https://github.com/InfluenceFunctional/E2EDNA2/blob/main/analysisTools.py

class atomDistances(AnalysisBase):

    def __init__(self, atomgroups, **kwargs):
        """
        :param atomgroups: a list of atomgroups for which the interatom bond lengths are calculated
        """
        super(atomDistances, self).__init__(atomgroups[0].universe.trajectory, **kwargs)
        self.atomgroups = atomgroups
        self.ag1 = atomgroups[0]
        self.ag2 = atomgroups[1]

    def _prepare(self):
        self.results = []

    def _single_frame(self):
        distance = calc_bonds(self.ag1.positions, self.ag2.positions, box=self.ag1.dimensions)
        self.results.append(distance)

    def _conclude(self):
        self.results = np.asarray(self.results)

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

@InfluenceFunctional thanks for sharing. Even though I agree that the atomDistances() class is simple, it is useful, and importantly, the fact that it is only a few lines of codes that nevertheless correctly carry out the task at hand is great. Many people calculate distances, and often it's done incorrectly or slower than necessary. As such, I actually see value in adding something like your class to the analysis.distances module.

I would open an issue for this. At the moment, we have many GSOC and Outreachy applicants here who are looking for manageable and well-defined issues and this would make for a great starter issue. May I include your code snippet as initial inspiration? Specifically, would you allow the code snippet in #3310 (comment) to be included in MDAnalysis or used as the basis for code with such functionality?

@InfluenceFunctional
Copy link
Author

InfluenceFunctional commented Apr 7, 2022

@orbeckst please by all means go ahead and use it in whatever way it could be useful!
I just realized I had sent over only the distances part, rather than the DNA-specific part --- here is the context in which we use it in our code to track Watson-Crick bond lengths for all possible combinations of bases in a single DNA sequence, over the full trajectory.

def getWCDistTraj(u):
    """
    use the atomDistances class to calculate the WC base pairing distances between all bases on a sequence
    :param u:
    :return:
    """
    n_bases = u.segments[0].residues.n_residues
    atomIndices1 = np.zeros((n_bases, n_bases))
    atomIndices2 = np.zeros_like(atomIndices1)
    # identify relevant atoms for WC distance calculation
    for i in range(1, n_bases + 1):
        for j in range(1, n_bases + 1):
            if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
                a1, a2 = "N3", "N1"
            if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
                a1, a2 = "N1", "N3"
            atoms = u.select_atoms("(resid {0!s} and name {1!s}) or (resid {2!s} and name {3!s}) ".format(i, a1, j, a2))
            atomIndices1[i - 1, j - 1] = atoms[0].id  # bond-mate 1
            if i == j:
                atomIndices2[i - 1, j - 1] = atoms[0].id  # if it's the same base, we want the resulting distance to always be zero
            else:
                atomIndices2[i - 1, j - 1] = atoms[1].id  # bond-mate 2

    # make a flat list of every combination
    bonds = [mda.AtomGroup(atomIndices1.flatten(), u), mda.AtomGroup(atomIndices2.flatten(), u)]
    na = atomDistances(bonds).run()
    traj = na.results.reshape(u.trajectory.n_frames, n_bases, n_bases)

    return traj

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants