Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SurvivalProbability - Readability, performance, and algorithm changes #1995

Merged
merged 64 commits into from
Aug 14, 2018

Conversation

bieniekmateusz
Copy link
Member

@bieniekmateusz bieniekmateusz commented Jul 18, 2018

Small changes:

  • Performance (16 minutes to 60 seconds):
    -- issue due to the unnecessary function overhead,
    -- iterating over the dataset only once (RAM memory optimisation)
  • More thorough documentation of parameters and additional sanity checks in init
  • Removed temporarily the progress meter which only worked for loading the simulation dataset, confusing the reader
  • Allows the user to access the distribution for each tau through sp.sp_timeseries
  • Returns the taus along with the timeseries (opening access for dt). Not returning tau = 0

Validity Test Cases Changes:

  • Replaced the arbitrary test cases with defined datasets and predictable taus
  • Added a test case where no atoms IDs are found. In this case, we return tau = NaN. This is because 0 means that the initial molecular leaves after a given tau. However, if there is no initial molecules in the first place, this would be the wrong conclusion.

Algorithmic Changes made in this Pull Request - Request for Contribution from the Original Author:

  • We modified the algorithm to be applicable to other atom groups (e.g. ions) besides water. Particularly, we changed the way the algorithm behaves in the case where there is no molecules found for the reference frame. Previously, one would expect the reference frame to always find some water molecules. If this was not the case, each tau was diluted - divided by (Nt + 1). Because it was unlikely to happen with any reasonable water selection, this case was not important. However, let us consider using SP the survival of ions in a given area, with frames where no ions are found in the first place (reference frame). For such frames, diluting the taus means that the ions leave earlier - whereas in fact, these ions might rarely diffuse into the selected area. The rare diffusion into the selected area should not affect how long they stay (ie their SurvivalProbability). Please let us know if we missed anything.

Questions:

  • Since this algorithm is applicable to any mobile atoms in a system, we suggest this package is moved out of waterdynamics into a more general package.

Future considerations:

  • Adding progress bar
  • Adding dt to allow sample the dataset with a faster shift (and possibly discuss data correlations for larger taus)

Edit: Acknowledgement - This work was done together with @p-j-smith, but we submitted most of it on my account.

PR Checklist

  • [X ] Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

bieniekmateusz and others added 22 commits June 21, 2018 16:49
Atom selections at each frame are now stored as sets rather than lists,
so determining the intersection of selections is faster.

Survival probability is now calculated by a single function.
The survival probability is a measure of how likely a selected molecule
is to remain within a specified region over a period of time. If at t0
the selected molecule is not in the specified region, no survival
probability can be calculated.

For example, consider that we would like to calculate the survival
probability of an ion at an interface. If the ion never approaches the
interface, the previous test would return a survival probability of 0,
which suggests that the ion comes to the interface but leaves
immediately. However, as the ion was never at the interface, no
survival probability can be calculated from this data.
-return the list rather than require the user to access the internal list
-use tau_max variable name which is consistent with the equation, in contrast to dtmax
… and more times over tau. This will lead to faster code (local memory/cache optimisation). Additionally, opens the door for combining multiple taus together.
…Or should the first data point reflect the first tau which is non-zero and therefore the first change is quantified? I believe it is the latter. This breaks the current tests which, for example, by asking for 4 datapoints, always get the first datapoint for tau=0 that is 1.
code and no performance drawbacks.

Tests: the user should not be forced to rely on the internal data
structures .timeseries. Returning the results with the function run().
numpy arrays.

Changed the name of tau_timeseries to sp_timeseries as it is a more
accurate descriptor.

Removed some of the +/- 1 in indices to improve readability.
Saving the extracted data for the user to be able to dig deeper, if
necessary.
an atom group, to reduce the memory load.

Updated the docstring.
Waterdynamics: removed an index bug that went 1 too high tau.
@richardjgowers
Copy link
Member

@bieniekmateusz thanks, this looks like it will make some good changes.

WRT "faster shift for dt", are you talking about doing every 2nd frame rather than every frame?

@bieniekmateusz
Copy link
Member Author

@richardjgowers Thanks. Also, I am still to add your test case for the t0case which I hope to add in the next version with the progress meter.

Your question: Yes, every nth frame. We are considering letting the user specify the jump over the trajectory. However, we want to first check how, for example, tau = 20 is affected by being calculated for overlapping datasets (t=0-20, t=1-21, ...). We are not sure how (and if) that affects the SP yet. Any thoughts on it are welcome.

@richardjgowers
Copy link
Member

For the case where no molecules are found, it makes sense to disregard this run as you can't measure anything. It's not that the molecule had 0 survival probability.

So for hydrogen bond autocorrelation I let people define a total duration (tau_max here) and a number of frames within that duration to consider. I'm not completely happy with that solution either, so it would be good to define a clear & comprehensive way/language to define this sort of stuff.

I need to read through this module fully but..

Reducing the resolution of frames should be fine in some cases. We can actually check after the calculation whether the calculated tau makes sense with respect to how we sampled it. Then issue some sort of warning if the data is probably junk (and suggest better settings).

Overlapping frames should be fine. The concentration in an area will remain constant, so by starting a new sample in the same window will identify different "starting molecules". It will be slightly correlated with the old sample though (so isn't 100% new data)

@codecov
Copy link

codecov bot commented Jul 18, 2018

Codecov Report

Merging #1995 into develop will decrease coverage by 0.08%.
The diff coverage is 57.77%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1995      +/-   ##
===========================================
- Coverage    88.93%   88.84%   -0.09%     
===========================================
  Files          144      144              
  Lines        17490    17497       +7     
  Branches      2693     2702       +9     
===========================================
- Hits         15554    15545       -9     
- Misses        1323     1332       +9     
- Partials       613      620       +7
Impacted Files Coverage Δ
package/MDAnalysis/analysis/waterdynamics.py 80.94% <57.77%> (-3.51%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2e22ff...8106a6c. Read the comment docs.

@bieniekmateusz
Copy link
Member Author

I've updated the last test case (t0) to avoid arbitrary datasets and to test tf in the same test case.

@richardjgowers richardjgowers self-assigned this Jul 19, 2018
@bieniekmateusz
Copy link
Member Author

I did the rebase which was not too much trouble at the end. Cheers

@bieniekmateusz
Copy link
Member Author

Error in testsuite/MDAnalysisTests/topology/test_pqr.py:93:

Could it be that the rebasing introduced this issue since I have not touched the topology files. Do you have any suggestions on what I should do to correct the tests? Thanks

@bieniekmateusz
Copy link
Member Author

I've updated the example to match the data being stored in the object's fields.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @bieniekmateusz

@bieniekmateusz
Copy link
Member Author

and @p-j-smith!

We're happy to hear that. And thanks to you too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants