potential performance improvements for H-bond analysis 'between' #4130

orbeckst · 2023-04-27T16:13:24Z

Is your feature request related to a problem?

In the review of PR #4092 , @richardjgowers observed that there's an opportunity for improving the performance for the "between" keyword of HydrogenBondAnalysis #4092 , namely to filter atomgroups for "between" before capped_distances() is called to calculate distances.

Describe the solution you'd like

Follow up on the suggestion and implement the suggested improvement and benchmark the old code vs the updated one.

Describe alternatives you've considered

Do nothing — that's fine, the update is not mission-critical.

Additional context

Discussion #4092 around

mdanalysis/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py

Line 735 in 3ebd5ec

if self.between_ags is not None:

@richardjgowers : [...] would it make more sense to filter before the call to capped_distances?

mdanalysis/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py

Line 712 in 3ebd5ec

d_a_indices, d_a_distances = capped_distance(
@p-j-smith : good point - that would definitely be better, but it's probably not a straightforward change and I imagine there are lots of ways to do it. One way would be to iterate over the pairs of atom selections passed to between and calculate donor-acceptor distances for each pair. So e.g. if between=[["protein", "SOL"], ["protein", "protein"]], we would calculate:
- protein donor to SOL acceptor distances
- protein acceptor to SOL donor distances
- protein donor to protein acceptor distances
and concatenate them.

The text was updated successfully, but these errors were encountered:

RMCrean · 2023-05-11T15:05:59Z

Hello,

I'd be happy to try and work on a PR for this if it is okay? I'm not a GSOC person though, just keen to start contributing to some open source software.

Assuming yes, my understanding of the inefficiency is:

Inside the function: _single_frame(self)

mdanalysis/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py

Line 742 in e46c5ef

if self.between_ags is not None:

If a user has provided a selection to the input between, then this is where the filtering occurs.

However, before this filtering, some calculations are run on atoms that will later be filtered.

mdanalysis/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py

Line 719 in e46c5ef

d_a_indices, d_a_distances = capped_distance(

The solution that I think you would like to see is:

Move the filtering with the self.between_ags to an earlier step in the process and before capped_distances. I think it could be moved to the function _prepare(self) but I'm not sure yet. Taking care that (as written above) the hydrogen bonding analysis parameter between can be a nested list.
Perform benchmarking to compare the speed. For the benchmarking I was thinking to test the speed of the code before and after using MDanalysis test files. For both the before and after system I would test using all atoms (to confirm the code does not take any longer), and with a limited selection, to make sure the code actually ends up being faster.

I have had a look at your documentation on how to contribute, and its very clear so thank you for that.

In summary, is it okay for me to start working on this and if so have I've understood the issue correctly? If so, I'll start working on a solution and a PR.

Thanks!
Rory

orbeckst · 2023-05-12T18:50:59Z

@RMCrean you're more than welcome to work on the issue, outside of GSOC. The tag just means that we think it's suitable for potential GSOC candidates.

I think your understanding of the problem is generally correct but @p-j-smith is the expert here.

I don't think that code in _prepare() will always solve the problem because that method is only called once at the start. However, it is possible that contents of atomgroups change between steps (when using UpdatingAtomGroup) and so one may have to evaluate at every step.

Please put up a PR! (EDIT: and please ensure that the PR mentions this issue so that the PR gets auto-linked with the issue)

RMCrean · 2023-05-15T10:57:24Z

Great, thanks. I'll start working on this later today then!

Thanks for the point about _prepare() and UpdatingAtomGroup, I'll keep it in mind when working on the problem.

RMCrean · 2023-05-17T13:04:24Z

Hi again, I've started working on this now and have made some progress and feel confident I understand how the calculation works now but I am not ready for a PR yet. I thought I could give a brief update though.

In order to test if the new implementation will work correctly and how it compares timing wise to the current implementation I've made a copy of the hbond_analysis.py file (which won't be edited or committed).

I've then setup a small test trajectory that has both protein and water (the first 10 frames from MDAnalysisData.datasets.fetch_ifabp_water() ).

Then I run though a number of different types of "between" commands using either the old implementation or the new implementation (once made) of hbond_analysis.py and analyse how long it takes on average to perform the calculation. I will also check the old and new forms of the calculation give the same results by comparing the number of hydrogen bonds found as well as checking the hbonds.results.hbonds are identical for the same between command used.

These are results for the original version of hbond_analysis.py:

It seems at least promising that everything roughly takes the same amount of time (~11 seconds) regardless of the type of between selection used.

Does this sound like a reasonable way to do a comparison?

(Also happy to share code used etc.. to do this part even if of course it won't be part of the PR)

orbeckst · 2023-05-17T21:14:54Z

Sounds like a decent start. I suggest you just put up a PR with what you have. You can mark it as draft if you like but it's generally a lot easier to discuss when code is visible.

For doing performance comparison, I would create two virtual environments (conda create -n mda_stable python=3.11 mdanalysis mdanalysisdata and conda create -n mda_dev python=3.11 ...). In the first one, you install the latest release so you have the standard version. In the second env, you install all pre-requisites and then install your development version of MDAnalysis with pip install -e package/ (i.e., a developer installation). You can then run your benchmarks in the two different environments mda_stable and mda_dev.

You can paste the code that you use for comparison into the comments of the PR.

We also have ASV benchmarking code in benchmarks/benchmarks that are run nightly. These benchmarks are published at https://www.mdanalysis.org/benchmarks/ and it would be really great to have some for hydrogen bond analysis.

Sumit112192 · 2024-01-05T10:43:43Z

I want to work on this if no one is working right now. @orbeckst

Sumit112192 · 2024-01-05T11:44:30Z

So, here is what I have found.
I created a file check.py with the code

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, GRO, XTC

from MDAnalysis.analysis.hydrogenbonds import HydrogenBondAnalysis as HBA


u = mda.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.run()

I ran cProfile to check which part was relatively taking the most cumulative time. Here is the result

The _get_dh_pairs function takes the most time in _single_frame. Also, in _get_dh_pairs, the line number 637

donors = sum(h.bonded_atoms[0] for h in hydrogens) if hydrogens \                 
        else AtomGroup([], self.u)

takes the most time, enumerating all possible donors in the found hydrogens.

To leverage the between keyword, we could set self.donors_sel to the value passed in the between keyword, which would avoid the enumeration and might make the program run fast.

Please let me know if my understanding is correct so I can work on it further. @orbeckst

orbeckst · 2024-01-05T20:08:59Z

Without directly looking at the code I don't know if this will solve the problem. Try it out and open a PR. Check that your changes don't break the tests. Benchmark.
Once we can look at code, it's a lot easier to follow your logic.

orbeckst added performance Component-Analysis GSOC Starter labels Apr 27, 2023

orbeckst mentioned this issue Apr 27, 2023

Ensure 'HydrogenBondAnalysis' returns the correct distances when using the 'between' keyword #4092

Merged

4 tasks

orbeckst modified the milestone: 2.5.0 Apr 27, 2023

Sumit112192 mentioned this issue Jan 11, 2024

Used between to update donors_sel and acceptors_sel #4422

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential performance improvements for H-bond analysis 'between' #4130

potential performance improvements for H-bond analysis 'between' #4130

orbeckst commented Apr 27, 2023 •

edited

Loading

RMCrean commented May 11, 2023

orbeckst commented May 12, 2023 •

edited

Loading

RMCrean commented May 15, 2023

RMCrean commented May 17, 2023

orbeckst commented May 17, 2023

Sumit112192 commented Jan 5, 2024

Sumit112192 commented Jan 5, 2024 •

edited

Loading

orbeckst commented Jan 5, 2024

potential performance improvements for H-bond analysis 'between' #4130

potential performance improvements for H-bond analysis 'between' #4130

Comments

orbeckst commented Apr 27, 2023 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

RMCrean commented May 11, 2023

orbeckst commented May 12, 2023 • edited Loading

RMCrean commented May 15, 2023

RMCrean commented May 17, 2023

orbeckst commented May 17, 2023

Sumit112192 commented Jan 5, 2024

Sumit112192 commented Jan 5, 2024 • edited Loading

orbeckst commented Jan 5, 2024

orbeckst commented Apr 27, 2023 •

edited

Loading

orbeckst commented May 12, 2023 •

edited

Loading

Sumit112192 commented Jan 5, 2024 •

edited

Loading