SurvivalProbability - Readability, performance, and algorithm changes #1995

bieniekmateusz · 2018-07-18T13:50:50Z

Small changes:

Performance (16 minutes to 60 seconds):
-- issue due to the unnecessary function overhead,
-- iterating over the dataset only once (RAM memory optimisation)
More thorough documentation of parameters and additional sanity checks in init
Removed temporarily the progress meter which only worked for loading the simulation dataset, confusing the reader
Allows the user to access the distribution for each tau through sp.sp_timeseries
Returns the taus along with the timeseries (opening access for dt). Not returning tau = 0

Validity Test Cases Changes:

Replaced the arbitrary test cases with defined datasets and predictable taus
Added a test case where no atoms IDs are found. In this case, we return tau = NaN. This is because 0 means that the initial molecular leaves after a given tau. However, if there is no initial molecules in the first place, this would be the wrong conclusion.

Algorithmic Changes made in this Pull Request - Request for Contribution from the Original Author:

We modified the algorithm to be applicable to other atom groups (e.g. ions) besides water. Particularly, we changed the way the algorithm behaves in the case where there is no molecules found for the reference frame. Previously, one would expect the reference frame to always find some water molecules. If this was not the case, each tau was diluted - divided by (Nt + 1). Because it was unlikely to happen with any reasonable water selection, this case was not important. However, let us consider using SP the survival of ions in a given area, with frames where no ions are found in the first place (reference frame). For such frames, diluting the taus means that the ions leave earlier - whereas in fact, these ions might rarely diffuse into the selected area. The rare diffusion into the selected area should not affect how long they stay (ie their SurvivalProbability). Please let us know if we missed anything.

Questions:

Since this algorithm is applicable to any mobile atoms in a system, we suggest this package is moved out of waterdynamics into a more general package.

Future considerations:

Adding progress bar
Adding dt to allow sample the dataset with a faster shift (and possibly discuss data correlations for larger taus)

Edit: Acknowledgement - This work was done together with @p-j-smith, but we submitted most of it on my account.

PR Checklist

[X ] Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

…t upstream

Atom selections at each frame are now stored as sets rather than lists, so determining the intersection of selections is faster. Survival probability is now calculated by a single function.

The survival probability is a measure of how likely a selected molecule is to remain within a specified region over a period of time. If at t0 the selected molecule is not in the specified region, no survival probability can be calculated. For example, consider that we would like to calculate the survival probability of an ion at an interface. If the ion never approaches the interface, the previous test would return a survival probability of 0, which suggests that the ion comes to the interface but leaves immediately. However, as the ion was never at the interface, no survival probability can be calculated from this data.

-return the list rather than require the user to access the internal list -use tau_max variable name which is consistent with the equation, in contrast to dtmax

… and more times over tau. This will lead to faster code (local memory/cache optimisation). Additionally, opens the door for combining multiple taus together.

…+tau. Refactoring.

…Or should the first data point reflect the first tau which is non-zero and therefore the first change is quantified? I believe it is the latter. This breaks the current tests which, for example, by asking for 4 datapoints, always get the first datapoint for tau=0 that is 1.

code and no performance drawbacks. Tests: the user should not be forced to rely on the internal data structures .timeseries. Returning the results with the function run().

numpy arrays. Changed the name of tau_timeseries to sp_timeseries as it is a more accurate descriptor. Removed some of the +/- 1 in indices to improve readability.

Saving the extracted data for the user to be able to dig deeper, if necessary.

an atom group, to reduce the memory load. Updated the docstring.

Waterdynamics: removed an index bug that went 1 too high tau.

richardjgowers · 2018-07-18T14:13:16Z

@bieniekmateusz thanks, this looks like it will make some good changes.

WRT "faster shift for dt", are you talking about doing every 2nd frame rather than every frame?

bieniekmateusz · 2018-07-18T14:25:08Z

@richardjgowers Thanks. Also, I am still to add your test case for the t0case which I hope to add in the next version with the progress meter.

Your question: Yes, every nth frame. We are considering letting the user specify the jump over the trajectory. However, we want to first check how, for example, tau = 20 is affected by being calculated for overlapping datasets (t=0-20, t=1-21, ...). We are not sure how (and if) that affects the SP yet. Any thoughts on it are welcome.

richardjgowers · 2018-07-18T14:37:40Z

For the case where no molecules are found, it makes sense to disregard this run as you can't measure anything. It's not that the molecule had 0 survival probability.

So for hydrogen bond autocorrelation I let people define a total duration (tau_max here) and a number of frames within that duration to consider. I'm not completely happy with that solution either, so it would be good to define a clear & comprehensive way/language to define this sort of stuff.

I need to read through this module fully but..

Reducing the resolution of frames should be fine in some cases. We can actually check after the calculation whether the calculated tau makes sense with respect to how we sampled it. Then issue some sort of warning if the data is probably junk (and suggest better settings).

Overlapping frames should be fine. The concentration in an area will remain constant, so by starting a new sample in the same window will identify different "starting molecules". It will be slightly correlated with the old sample though (so isn't 100% new data)

…f the first two frames are skipped.

codecov · 2018-07-18T23:33:02Z

Codecov Report

Merging #1995 into develop will decrease coverage by 0.08%.
The diff coverage is 57.77%.

@@             Coverage Diff             @@
##           develop    #1995      +/-   ##
===========================================
- Coverage    88.93%   88.84%   -0.09%     
===========================================
  Files          144      144              
  Lines        17490    17497       +7     
  Branches      2693     2702       +9     
===========================================
- Hits         15554    15545       -9     
- Misses        1323     1332       +9     
- Partials       613      620       +7

Impacted Files	Coverage Δ
package/MDAnalysis/analysis/waterdynamics.py	`80.94% <57.77%> (-3.51%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2e22ff...8106a6c. Read the comment docs.

…ults will be given.

bieniekmateusz · 2018-07-18T23:45:27Z

I've updated the last test case (t0) to avoid arbitrary datasets and to test tf in the same test case.

Saving the extracted data for the user to be able to dig deeper, if necessary.

an atom group, to reduce the memory load. Updated the docstring.

Waterdynamics: removed an index bug that went 1 too high tau.

…f the first two frames are skipped.

…ults will be given.

bieniekmateusz · 2018-07-28T16:29:28Z

I did the rebase which was not too much trouble at the end. Cheers

…lysis into survivalprobability

bieniekmateusz · 2018-08-01T00:11:22Z

Error in testsuite/MDAnalysisTests/topology/test_pqr.py:93:

Could it be that the rebasing introduced this issue since I have not touched the topology files. Do you have any suggestions on what I should do to correct the tests? Thanks

bieniekmateusz · 2018-08-03T13:11:56Z

I've updated the example to match the data being stored in the object's fields.

richardjgowers

LGTM, thanks @bieniekmateusz

bieniekmateusz · 2018-08-13T16:57:45Z

and @p-j-smith!

We're happy to hear that. And thanks to you too!

bieniekmateusz and others added 22 commits June 21, 2018 16:49

The windows size tau + t was already accounted for in _getMeanOnePoin…

69be6dc

…t upstream

Use reverse search for number of survivors. Stop search if no survivors.

3e65d43

Optimised calculation of survival probability.

4099187

Atom selections at each frame are now stored as sets rather than lists, so determining the intersection of selections is faster. Survival probability is now calculated by a single function.

Latest version by Paul to remove the unnecessary functions

b5e1ab2

Syntax errors

967a569

Minor changes:

9cf3928

-return the list rather than require the user to access the internal list -use tau_max variable name which is consistent with the equation, in contrast to dtmax

Switch the nested loops. Looping only once over the entire trajectory…

afb48ac

… and more times over tau. This will lead to faster code (local memory/cache optimisation). Additionally, opens the door for combining multiple taus together.

No need to check t+tau for any tau if at t it is empty.

d71ed58

If there is no atom at t, this does not need to be checked for each t…

b711cb2

…+tau. Refactoring.

Using the set.intersection argument unpacking: easier to read, neater

42704d7

code and no performance drawbacks. Tests: the user should not be forced to rely on the internal data structures .timeseries. Returning the results with the function run().

Returning the survival probability and the values of tau as

9370e8f

numpy arrays. Changed the name of tau_timeseries to sp_timeseries as it is a more accurate descriptor. Removed some of the +/- 1 in indices to improve readability.

Adjusting the tests because the return function gives back taus too.

87b6d34

Moving the checks to the constructor.

3655647

Saving the extracted data for the user to be able to dig deeper, if necessary.

verbose mode added + better documentation.

c89a690

The tf is inclusive.

ea1c744

Storing a set of atom ids at each timestep, rather than a set containing

683f1cb

an atom group, to reduce the memory load. Updated the docstring.

tiny bug

ebd1b33

Check if all tau == 1 when the same atom IDs are returned at each frame

53abc62

Mock data for the test case.

d471bd0

Simple test that checks for the known Survival Probability.

49882ff

Waterdynamics: removed an index bug that went 1 too high tau.

Defined dataset for the test case t0 != 0. The code will be correct i…

5bd2e0b

…f the first two frames are skipped.

t0 testcase covers also tf. If one too many frames is used, wrong res…

d194cee

…ults will be given.

richardjgowers self-assigned this Jul 19, 2018

bieniekmateusz and others added 18 commits July 28, 2018 16:38

Moving the checks to the constructor.

d8f4a60

Saving the extracted data for the user to be able to dig deeper, if necessary.

verbose mode added + better documentation.

b5d07e9

The tf is inclusive.

d348cc7

Storing a set of atom ids at each timestep, rather than a set containing

0ef1057

an atom group, to reduce the memory load. Updated the docstring.

tiny bug

8470c4c

Check if all tau == 1 when the same atom IDs are returned at each frame

40e926d

Mock data for the test case.

c66f3db

Simple test that checks for the known Survival Probability.

ab21e39

Waterdynamics: removed an index bug that went 1 too high tau.

Defined dataset for the test case t0 != 0. The code will be correct i…

a53a673

…f the first two frames are skipped.

t0 testcase covers also tf. If one too many frames is used, wrong res…

9a20846

…ults will be given.

Trailing lines

9f8c677

Author and changelog updated.

c01c994

Documentation and requested changes PR MDAnalysis#1995.

036ccc6

Tests adjusted to the new interface.

4b1ee14

Conforming to the run() interface defined in MDAnalysis#1463.

2c36f37

Backward compatibility (and priority) with warnings.

5988f96

Rebased and resolved conflict with CHANGELOG

46a7f69

Conforming to MDAnalysis overall API design. PR MDAnalysis#1995

8536d54

bieniekmateusz force-pushed the survivalprobability branch from a064686 to 8536d54 Compare July 28, 2018 16:28

Merge branch 'survivalprobability' of github.com:bieniekmateusz/mdana…

2a1df89

…lysis into survivalprobability

Updated comments

441d5ba

Merge branch 'develop' into survivalprobability

8106a6c

richardjgowers approved these changes Aug 13, 2018

View reviewed changes

richardjgowers merged commit 5cf8c55 into MDAnalysis:develop Aug 14, 2018

bieniekmateusz mentioned this pull request Apr 5, 2019

Survival probability: Intermittency, Step, Residues, Overlapping SP Selection Example #2226

Merged

4 tasks

p-j-smith mentioned this pull request Apr 5, 2019

Hbond analysis #2237

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SurvivalProbability - Readability, performance, and algorithm changes #1995

SurvivalProbability - Readability, performance, and algorithm changes #1995

bieniekmateusz commented Jul 18, 2018 •

edited

Loading

richardjgowers commented Jul 18, 2018

bieniekmateusz commented Jul 18, 2018

richardjgowers commented Jul 18, 2018

codecov bot commented Jul 18, 2018 •

edited

Loading

bieniekmateusz commented Jul 18, 2018

bieniekmateusz commented Jul 28, 2018

bieniekmateusz commented Aug 1, 2018

bieniekmateusz commented Aug 3, 2018

richardjgowers left a comment

bieniekmateusz commented Aug 13, 2018

SurvivalProbability - Readability, performance, and algorithm changes #1995

SurvivalProbability - Readability, performance, and algorithm changes #1995

Conversation

bieniekmateusz commented Jul 18, 2018 • edited Loading

PR Checklist

richardjgowers commented Jul 18, 2018

bieniekmateusz commented Jul 18, 2018

richardjgowers commented Jul 18, 2018

codecov bot commented Jul 18, 2018 • edited Loading

Codecov Report

bieniekmateusz commented Jul 18, 2018

bieniekmateusz commented Jul 28, 2018

bieniekmateusz commented Aug 1, 2018

bieniekmateusz commented Aug 3, 2018

richardjgowers left a comment

Choose a reason for hiding this comment

bieniekmateusz commented Aug 13, 2018

bieniekmateusz commented Jul 18, 2018 •

edited

Loading

codecov bot commented Jul 18, 2018 •

edited

Loading