-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
West tools kinetics rewrite #35
Conversation
averaging together multiple weighted ensemble simulations.
… tool built by Lewis Baker.
… (for the grouping function).
… from the west.cfg file.
Relies on new WESTTool type, 'WESTMultiTool', as a base which handles the file loading and argument parsing necessary.
Bringing this up to date.
Forgot to pull in up to date master.
w_postanalysis_reweight.py
…or some preliminary tests.
… This is needed to work with w_multi_reweight.py.
…obably broken right now; I will update it again soon.
…well as ability to automatically convert from state indices to bin indices (for denoting bins subject to recycling).
…ns every timepoint or every iteration. Also commented out check for target states in w_postanalysis_matrix, as I think this tool produces valid results for simulations with target states; this check could perhaps be moved to w_postanalysis_reweight in the future.
Fluxanl is fixed, multitool stuff has been purged, all references to w_ipython are gone, /lib/west_tools/westpa/_rc.py has been reverted to the original version (changes committed to another branch, now). We're getting close, here. The bootstrap is now optional on all dataset evolutions (flag --disable-bootstrap or -db), which significantly speeds up calculation of the mean if that's all you're interested in. This is explicitly supported in mclib, which might seem odd except that it's already set up to run estimators and return datasets using the block options (and the datasets are properly shaped, at that). @mczwier, I know one of your original concerns was whether the mclib library retained the functionality to 'just work' on an arbitrary dataset. If you have any pre-existing code, could you check whether it works as expected, currently? TO DO, STILL:
|
…onger have bootstrapping option flags.
FIXED/FINISHED:
TODO: |
COMPLETED:
CONSISTENT COMMANDS: It didn't make much sense to name the subcommands after their old tool names (this is, after all, why we're leaving in placeholders to the old commands). Therefore, each tool now has the following commands
INIT: KINETICS: PROBS: AVERAGES: ALL: In addition, "kintool" has been moved and renamed. It's now in the westtools folder. It's not a real 'general purpose' subclass of the various WESTTools classes, but it certainly seems to belong there. At this time, it's probably appropriate to do a full code review. |
… from old toolkit model to new one.
… current; changed references in w_direct and westpa/__init__ accordingly.
…both w_direct and w_reweight.
Ploterr has been updated to work with both tools, and to also plot steady-state populations (or 'color populations' or ensemble populations, or whatever). The subcommands for ploterr are similar to w_direct and w_reweight: {d/rw}.probs for state and ensemble probabilities, and {d/rw}.kinetics for plotting fluxes and rates. The help text on w_direct has been fixed to accurately inform the end user of what the subcommands do (there were just copy/pasted defaults in there, before). w_fluxanl has been updated to also support optional bootstrap/correlation analysis on evolution datasets. While w_fluxanl has always been pretty fast, it's still the case that we don't always want to spend the time generating calculating error intervals; sometimes, a very rough estimate is all we require. As with w_direct and w_reweight, the flags -db (--disable-bootstrap) and -dc (--disable-correlation) are available. There are now stubs such that w_postanalysis_reweight and w_postanalysis_matrix can be called, like before, so scripts/workflows that relied on these tools should not break. Critically, I cannot think of anything else that needs to be completed before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, Adam. Thank you for your hard work. I'm not too concerned about mclib breaking; I use it, but not often enough to be annoyed at having to adapt my code somewhat for the more flexible backend. The only thing I can think of (though it is super annoying) is to update the copyright date on the files you've modified or written. (i.e. (c) 2013 – 2017).
The copyright date has been updated on files I've modified (and the appropriate headers added to new files). The only one I haven't touched is w_assign, as the changes there aren't real (there's no functional difference), but it WILL be modified in a separate branch/pull request coming up soon that I'd prefer to not have to git merge (again), so that'll happen soon enough. |
West tools kinetics rewrite merged. Former-commit-id: b07237e
It's time for a code review! This branch includes all the previously discussed changes (see here: westpa/west_tools#17) to the kinetics code. Let's start with a brief summary of what's been changed, what's been added, and what's been marked for future deprecation, followed by a TODO (either pre or post merge, depending on what we would like).
SUMMARY:
Introduces changes to mclib code and both types of kinetics analysis (w_kinavg,w_kinetics,w_postanalysis_matrix/reweight), including changes and new versions of the estimator functions. Both types of analysis now produce the same datasets, and all calculate error through monte carlo block bootstrapping.
NEW TOOLS:
w_direct: w_kinetics, w_stateprobs, and w_kinavg have all been combined into this tool.
w_reweight: w_postanalysis_matrix and w_postanalysis_reweight are combined here.
Both tools use subcommands to perform the analysis, and include some 'helper' commands to easily run everything.
w_direct/reweight: all, average, stateprobs
w_direct: kinavg, kinetics
w_reweight: matrix, rate
The flags you would use for each of the old tools should still work with the new ones and their respective subcommands. By default, each tool writes each analysis dataset to one file: direct.h5, and reweight.h5. You can choose to write the output of each dataset to a different file, as well, to be compatible with how things were run before.
The old w_kinavg error calling routines have been abstracted away into an 'AverageCommands' class, which exists in westpa/kintool (place in with other tools?). Each tool subclasses this, overrides the go function, and has a unique function that performs most or all of the actual analysis (typically named for its old tool). This makes generating the futures for the work manager, printing responsible output, etc, all trivial and uniform across tools. Each tool still needs to build its own block to submit (ala how it was done before) to the mclib, depending on how complex the estimator is (this could probably be simplified).
The code that once was w_postanalysis_matrix and w_kinetics now exists largely in westpa/reweight and westpa/kinetics, and is largely unchanged.
All observable calculations are done through their cython estimators; there is no non-cython code for the reweighting procedure or the rate calculation procedure. The reweighting estimator, reweight_for_c, is a cython implementation of the exact algorithm that was once in w_postanalysis_matrix.
RATE AND ERROR CALCULATIONS:
The rate estimator function for w_direct is different from w_kinavg, as it now uses the ratio of sums vs. the sum of the ratios (see the presentation file from the WESTPA meeting from January 27th, 2017). This substantially increases the complexity of the estimator function used in the bootstrap, as numpy.mean can no longer be used (the estimator is now the rate constant equation itself). In addition, reweight_for_c, the reweighting estimator, is very complex and requires a large number of datasets.
To accommodate the necessities of calculating error through reweighting and the new direct trace code, the way that programs interface with the mclib functions is different. It now accepts two dictionaries: a dictionary of datasets that are able to be decimated for the blocking procedure (based on a correlation analysis), and a dictionary of kwargs for the estimator function. To account for correlation, the estimator is run without bootstrapping within mclib to calculate the observable over the time period indicated, and the correlation analysis/bootstrapping proceeds from there. In the event that a non dictionary is sent in as a dataset, the code should detect this and handle this appropriately. The block length determined from correlation is typically sent on to the estimator, as the data from correlated blocks may need to be handled in a very complex manner.
As an example, the dataset dictionary for reweight_for_c is only the iterations submitted, whereas the estimator kwargs are the stride length, all the flux events, the number of observations, the type of observable required (as reweighting can generate all of them), etc. For 3 bins, this is slightly slower than the direct trace.
FOR FUTURE DEPRECATION:
w_stateprobs, w_kinetics, and w_kinavg can be called as they once were, but send in a deprecation warning. They accept the same flags, and work mostly in the same way (with some minor differences).
TODO:
Add flag to disable bootstrapping if error is not required.
w_postanalysis_matrix and w_postanalysis_reweight need to be rewritten to just call the new code and generate a deprecation warning.
The routines have been tested numerous times on numerous datasets, and return the correct answers. Please test the code and try and break it. In addition, please comment on any areas where what's going on isn't clear or could be better documented, as this rewrite has resulted in a substantial amount of code changes.