Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential performance improvements for H-bond analysis 'between' #4130

Open
orbeckst opened this issue Apr 27, 2023 · 8 comments
Open

potential performance improvements for H-bond analysis 'between' #4130

orbeckst opened this issue Apr 27, 2023 · 8 comments

Comments

@orbeckst
Copy link
Member

orbeckst commented Apr 27, 2023

Is your feature request related to a problem?

In the review of PR #4092 , @richardjgowers observed that there's an opportunity for improving the performance for the "between" keyword of HydrogenBondAnalysis #4092 , namely to filter atomgroups for "between" before capped_distances() is called to calculate distances.

Describe the solution you'd like

Follow up on the suggestion and implement the suggested improvement and benchmark the old code vs the updated one.

Describe alternatives you've considered

Do nothing — that's fine, the update is not mission-critical.

Additional context

Discussion #4092 around

  • @richardjgowers : [...] would it make more sense to filter before the call to capped_distances?

    d_a_indices, d_a_distances = capped_distance(

  • @p-j-smith : good point - that would definitely be better, but it's probably not a straightforward change and I imagine there are lots of ways to do it. One way would be to iterate over the pairs of atom selections passed to between and calculate donor-acceptor distances for each pair. So e.g. if between=[["protein", "SOL"], ["protein", "protein"]], we would calculate:

    • protein donor to SOL acceptor distances
    • protein acceptor to SOL donor distances
    • protein donor to protein acceptor distances

    and concatenate them.

@RMCrean
Copy link

RMCrean commented May 11, 2023

Hello,

I'd be happy to try and work on a PR for this if it is okay? I'm not a GSOC person though, just keen to start contributing to some open source software.

Assuming yes, my understanding of the inefficiency is:

Inside the function: _single_frame(self)

If a user has provided a selection to the input between, then this is where the filtering occurs.

However, before this filtering, some calculations are run on atoms that will later be filtered.

d_a_indices, d_a_distances = capped_distance(

The solution that I think you would like to see is:

  • Move the filtering with the self.between_ags to an earlier step in the process and before capped_distances. I think it could be moved to the function _prepare(self) but I'm not sure yet. Taking care that (as written above) the hydrogen bonding analysis parameter between can be a nested list.
  • Perform benchmarking to compare the speed. For the benchmarking I was thinking to test the speed of the code before and after using MDanalysis test files. For both the before and after system I would test using all atoms (to confirm the code does not take any longer), and with a limited selection, to make sure the code actually ends up being faster.

I have had a look at your documentation on how to contribute, and its very clear so thank you for that.

In summary, is it okay for me to start working on this and if so have I've understood the issue correctly? If so, I'll start working on a solution and a PR.

Thanks!
Rory

@orbeckst
Copy link
Member Author

orbeckst commented May 12, 2023

@RMCrean you're more than welcome to work on the issue, outside of GSOC. The tag just means that we think it's suitable for potential GSOC candidates.

I think your understanding of the problem is generally correct but @p-j-smith is the expert here.

I don't think that code in _prepare() will always solve the problem because that method is only called once at the start. However, it is possible that contents of atomgroups change between steps (when using UpdatingAtomGroup) and so one may have to evaluate at every step.

Please put up a PR! (EDIT: and please ensure that the PR mentions this issue so that the PR gets auto-linked with the issue)

@RMCrean
Copy link

RMCrean commented May 15, 2023

Great, thanks. I'll start working on this later today then!

Thanks for the point about _prepare() and UpdatingAtomGroup, I'll keep it in mind when working on the problem.

@RMCrean
Copy link

RMCrean commented May 17, 2023

Hi again, I've started working on this now and have made some progress and feel confident I understand how the calculation works now but I am not ready for a PR yet. I thought I could give a brief update though.

In order to test if the new implementation will work correctly and how it compares timing wise to the current implementation I've made a copy of the hbond_analysis.py file (which won't be edited or committed).

I've then setup a small test trajectory that has both protein and water (the first 10 frames from MDAnalysisData.datasets.fetch_ifabp_water() ).

Then I run though a number of different types of "between" commands using either the old implementation or the new implementation (once made) of hbond_analysis.py and analyse how long it takes on average to perform the calculation. I will also check the old and new forms of the calculation give the same results by comparing the number of hydrogen bonds found as well as checking the hbonds.results.hbonds are identical for the same between command used.

These are results for the original version of hbond_analysis.py:
image

It seems at least promising that everything roughly takes the same amount of time (~11 seconds) regardless of the type of between selection used.

Does this sound like a reasonable way to do a comparison?

(Also happy to share code used etc.. to do this part even if of course it won't be part of the PR)

@orbeckst
Copy link
Member Author

Sounds like a decent start. I suggest you just put up a PR with what you have. You can mark it as draft if you like but it's generally a lot easier to discuss when code is visible.

For doing performance comparison, I would create two virtual environments (conda create -n mda_stable python=3.11 mdanalysis mdanalysisdata and conda create -n mda_dev python=3.11 ...). In the first one, you install the latest release so you have the standard version. In the second env, you install all pre-requisites and then install your development version of MDAnalysis with pip install -e package/ (i.e., a developer installation). You can then run your benchmarks in the two different environments mda_stable and mda_dev.

You can paste the code that you use for comparison into the comments of the PR.

We also have ASV benchmarking code in benchmarks/benchmarks that are run nightly. These benchmarks are published at https://www.mdanalysis.org/benchmarks/ and it would be really great to have some for hydrogen bond analysis.

@Sumit112192
Copy link
Contributor

I want to work on this if no one is working right now. @orbeckst

@Sumit112192
Copy link
Contributor

Sumit112192 commented Jan 5, 2024

So, here is what I have found.
I created a file check.py with the code

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, GRO, XTC

from MDAnalysis.analysis.hydrogenbonds import HydrogenBondAnalysis as HBA


u = mda.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.run()

I ran cProfile to check which part was relatively taking the most cumulative time. Here is the result
image

The _get_dh_pairs function takes the most time in _single_frame. Also, in _get_dh_pairs, the line number 637

donors = sum(h.bonded_atoms[0] for h in hydrogens) if hydrogens \                 
        else AtomGroup([], self.u)

takes the most time, enumerating all possible donors in the found hydrogens.

To leverage the between keyword, we could set self.donors_sel to the value passed in the between keyword, which would avoid the enumeration and might make the program run fast.

Please let me know if my understanding is correct so I can work on it further. @orbeckst

@orbeckst
Copy link
Member Author

orbeckst commented Jan 5, 2024

Without directly looking at the code I don't know if this will solve the problem. Try it out and open a PR. Check that your changes don't break the tests. Benchmark.
Once we can look at code, it's a lot easier to follow your logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants