nuclinfo.wc_pair extremely slow #3310

InfluenceFunctional · 2021-05-20T03:12:38Z

I'm using analysis.nuclinfo.wc_pair to profile base pair distances over long trajectories + long DNA sequences. It seems to be rather fast over ~1-100 executions, but when evaluated over a loop e.g.,
for t in all time_steps:
for i in all bases:
for j in all bases:
distance[t,i,j]=nuclinfo.wc_pair(etc.)

When the total number of wc_pair evaluations reaches ~1e3 (rough guess), evaluation gets extremely slow, or indeed hangs and will not finish at all on my platform. (MDAnalysis 1.0.0 on python 3.8.5 on Windows).

Any guidance at all on why this might be the case would be very much appreciated. It evaluates rather quickly for few-iterations, but seems to hang when queried repeatedly over thousands of base pairs and time steps. This is an extremely useful function for my work in-principle, but not if it keeps hanging for long trajectories and/or large DNA sequences.

orbeckst · 2021-05-20T09:02:26Z

Did you measure the time it takes to analyze one base pair (e.g., using %timeit) and then measured the time for a N basepairs? Is the time for N basepairs N times the time for one? Similarly, if you measure the time for one trajectory frame and M frames, does the time scale linearly with M? If it grows faster than linearly, then there's a problem somewhere. If it grows linearly, it's just slow.

Looking at the code

mdanalysis/package/MDAnalysis/analysis/nuclinfo.py

Lines 144 to 150 in 6b4eea0

    
           if universe.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]: 
        
               a1, a2 = "N3", "N1" 
        
           if universe.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]: 
        
               a1, a2 = "N1", "N3" 
        
           wc_dist = universe.select_atoms("(segid {0!s} and resid {1!s} and name {2!s}) " 
        
                                           "or (segid {3!s} and resid {4!s} and name {5!s}) " 
        
                                           .format(seg1, i, a1, seg2, bp, a2))

I see it doing a number of atom selections for every step and every base pair, which is admittedly not optimized for trajectory analysis, so it's likely slow. (This is one of the older modules in MDAnalysis...). A solution would be to rewrite the analysis with atom selections that are determined before iterating over frames. Furthermore, the distance calculations for all basepairs could be done in one calculation (and also take PBC into account, which they don't do at the moment), further speeding it up.

In short, this module is in need of a rewrite (using our "modern" AnalysisBase approach). If you (or anyone else) interested in working on it then please do: code contributions are very welcome — MDAnalysis is open source and driven by the users, in particular by users volunteering their time. The best person to write such code is a scientist who needs the code :-).

The User Guide tells you how to get started contributing to MDAnalysis.

IAlibay · 2021-05-20T10:09:26Z

@InfluenceFunctional

Given that you're saying it gets slower / hangs as it progresses, I'm wondering if you're maybe running out of memory? Do you have any insights on how the memory usage is increasing (i.e. from monitoring task manager)?

InfluenceFunctional · 2021-05-21T15:17:46Z

@IAlibay @orbeckst

Thanks for the timely feedback! Sorry I should have thought to include more detailed profiling in my original post. I've done several tests and it looks like

Memory usage is constant.
Execution time scales linearly with the length of the trajectory, which makes sense.
Execution time scales ~quadratically with the length of the DNA sequence, which also makes sense, since I'm computing all NxN base pair distances.

I suspect now what I thought was the code 'hanging' before was just it taking a long time to run 40x40x10000 times, sorry for the confusion on that.

Since this is such a neat/useful function I'll also have a look at the source and maybe see what I can do to speed it up.

orbeckst · 2021-05-26T17:08:51Z

One possibility for improving and modernizing this code would be to write a modern Analysis class (based on analysis.base.AnalysisBase) that makes the required selections once during the _prepare() step and then uses fast distance calculations (namely lib.distances.calc_bonds()) to calculate all distances per time step. Using calc_bonds will also ensure that PBC are correctly taken into account.

We did something similar for the Dihedral analysis and got a speed up of ~100x compared to the previous naive approach.

ALescoulie · 2022-04-03T21:04:42Z

is this something that there is still interest in improving? I've seen how dihedral improves speed and I don't think it would be too hard to replicate that here to improve performance.

edit: adding more details

orbeckst · 2022-04-03T22:34:52Z

I think we would really like this whole module to be transitioned to AnalysisBase style (I opened #3600) before working on it in earnest. If you want to tackle #3600 then please go for it and then you could always come back here.

InfluenceFunctional · 2022-04-03T22:44:15Z

@orbeckst I took your advice in May of last year and wrote it up as the below in my OpenDNA (now E2EDNA2) code. Indeed, it runs extremely fast now.

I didn't think it was general enough as-written to port over to MDA. I had always meant to clean it up for this purpose but never got around to it.

See here:
https://github.com/InfluenceFunctional/E2EDNA2/blob/main/analysisTools.py

class atomDistances(AnalysisBase):

    def __init__(self, atomgroups, **kwargs):
        """
        :param atomgroups: a list of atomgroups for which the interatom bond lengths are calculated
        """
        super(atomDistances, self).__init__(atomgroups[0].universe.trajectory, **kwargs)
        self.atomgroups = atomgroups
        self.ag1 = atomgroups[0]
        self.ag2 = atomgroups[1]

    def _prepare(self):
        self.results = []

    def _single_frame(self):
        distance = calc_bonds(self.ag1.positions, self.ag2.positions, box=self.ag1.dimensions)
        self.results.append(distance)

    def _conclude(self):
        self.results = np.asarray(self.results)

orbeckst · 2022-04-07T18:04:02Z

@InfluenceFunctional thanks for sharing. Even though I agree that the atomDistances() class is simple, it is useful, and importantly, the fact that it is only a few lines of codes that nevertheless correctly carry out the task at hand is great. Many people calculate distances, and often it's done incorrectly or slower than necessary. As such, I actually see value in adding something like your class to the analysis.distances module.

I would open an issue for this. At the moment, we have many GSOC and Outreachy applicants here who are looking for manageable and well-defined issues and this would make for a great starter issue. May I include your code snippet as initial inspiration? Specifically, would you allow the code snippet in #3310 (comment) to be included in MDAnalysis or used as the basis for code with such functionality?

InfluenceFunctional · 2022-04-07T19:38:25Z

@orbeckst please by all means go ahead and use it in whatever way it could be useful!
I just realized I had sent over only the distances part, rather than the DNA-specific part --- here is the context in which we use it in our code to track Watson-Crick bond lengths for all possible combinations of bases in a single DNA sequence, over the full trajectory.

def getWCDistTraj(u):
    """
    use the atomDistances class to calculate the WC base pairing distances between all bases on a sequence
    :param u:
    :return:
    """
    n_bases = u.segments[0].residues.n_residues
    atomIndices1 = np.zeros((n_bases, n_bases))
    atomIndices2 = np.zeros_like(atomIndices1)
    # identify relevant atoms for WC distance calculation
    for i in range(1, n_bases + 1):
        for j in range(1, n_bases + 1):
            if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
                a1, a2 = "N3", "N1"
            if u.select_atoms(" resid {0!s} ".format(i)).resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
                a1, a2 = "N1", "N3"
            atoms = u.select_atoms("(resid {0!s} and name {1!s}) or (resid {2!s} and name {3!s}) ".format(i, a1, j, a2))
            atomIndices1[i - 1, j - 1] = atoms[0].id  # bond-mate 1
            if i == j:
                atomIndices2[i - 1, j - 1] = atoms[0].id  # if it's the same base, we want the resulting distance to always be zero
            else:
                atomIndices2[i - 1, j - 1] = atoms[1].id  # bond-mate 2

    # make a flat list of every combination
    bonds = [mda.AtomGroup(atomIndices1.flatten(), u), mda.AtomGroup(atomIndices2.flatten(), u)]
    na = atomDistances(bonds).run()
    traj = na.results.reshape(u.trajectory.n_frames, n_bases, n_bases)

    return traj

orbeckst · 2022-04-07T19:45:28Z

Thanks, @InfluenceFunctional!

Btw, if you want your software be featured on https://www.mdanalysis.org/pages/used-by/ open a PR in https://github.com/MDAnalysis/MDAnalysis.github.io/ by editing https://github.com/MDAnalysis/MDAnalysis.github.io/blob/master/pages/used-by.md

orbeckst added Component-Analysis performance labels May 20, 2021

ALescoulie mentioned this issue Apr 7, 2022

Higher Performance nuclinfo #3611

Merged

5 tasks

orbeckst linked a pull request Apr 7, 2022 that will close this issue

Higher Performance nuclinfo #3611

Merged

5 tasks

richardjgowers closed this as completed in #3611 Apr 19, 2022

orbeckst mentioned this issue Apr 22, 2022

add simple atomic distance analysis to analysis.distances #3654

Closed

xhgchen mentioned this issue Mar 30, 2023

Add simple atomic distance analysis (#3654) #4105

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nuclinfo.wc_pair extremely slow #3310

nuclinfo.wc_pair extremely slow #3310

InfluenceFunctional commented May 20, 2021

orbeckst commented May 20, 2021

IAlibay commented May 20, 2021

InfluenceFunctional commented May 21, 2021

orbeckst commented May 26, 2021

ALescoulie commented Apr 3, 2022 •

edited

Loading

orbeckst commented Apr 3, 2022

InfluenceFunctional commented Apr 3, 2022 •

edited by orbeckst

Loading

orbeckst commented Apr 7, 2022

InfluenceFunctional commented Apr 7, 2022 •

edited by orbeckst

Loading

orbeckst commented Apr 7, 2022 •

edited

Loading

nuclinfo.wc_pair extremely slow #3310

nuclinfo.wc_pair extremely slow #3310

Comments

InfluenceFunctional commented May 20, 2021

orbeckst commented May 20, 2021

IAlibay commented May 20, 2021

InfluenceFunctional commented May 21, 2021

orbeckst commented May 26, 2021

ALescoulie commented Apr 3, 2022 • edited Loading

orbeckst commented Apr 3, 2022

InfluenceFunctional commented Apr 3, 2022 • edited by orbeckst Loading

orbeckst commented Apr 7, 2022

InfluenceFunctional commented Apr 7, 2022 • edited by orbeckst Loading

orbeckst commented Apr 7, 2022 • edited Loading

ALescoulie commented Apr 3, 2022 •

edited

Loading

InfluenceFunctional commented Apr 3, 2022 •

edited by orbeckst

Loading

InfluenceFunctional commented Apr 7, 2022 •

edited by orbeckst

Loading

orbeckst commented Apr 7, 2022 •

edited

Loading