-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create framework for extracting, verifying and updating model reproducibility and performance #83
Comments
Existing tools
|
UM rose stem testing checks for bitwise reproducibility by comparing output files to Known Good Output (KGO) files on disk (e.g. in Also compares Helmholtz solver statistics from model log files to those in KGO file.
These are effectively a global checksum and even with the limited precision any differences in a run show up in a few hours. Values could be stored in a DB somewhere rather than extracted from the KGO. There are also comparisons of execution time by extracting run time from model logfiles JULES rose stem uses bitwise comparison of netCDF KGO files with nccmp. LFRic rose stem uses a simple checksum (global sum of X2) of selected fields calculated during the run. |
Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate. As for extracting performance information, it is likely this would be limited to runs of sufficient length to give useful data. Do the UM, JULES/CABLE, LFric have internal timing infrastructure that can be configured? For example MOM (via FMS) has user-configurable clocks for recording timings with varying levels of granularity. In this way MOM5 (and other FMS models) can give useful timing information from short runs by isolating timings from different sections of the program. |
The UM has internal timers which give inclusive timing at a high level, e.g. radiation. See the end of It also has an interface to the DrHook library from ECMWF which gives subroutine level timing via start and end calls in each routine. This can have a significant overhead and isn't used routinely. See With both of these, I've sometimes added timers around small blocks of code when optimising. JULES has the DrHook interface but no separate timers. In UM timer output JULES will be included in the boundary layer. LRFic has something similar to the UM timers. |
Of course there are also the tools Nic Hannah developed for testing ACCESS-OM2 https://github.com/COSIMA/access-om2/tree/master/test and MOM5 https://github.com/mom-ocean/MOM5/tree/master/test The MOM5 tests are run regularly on Jenkins, first https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_run/activity which runs module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable &&
module load pbs && \
cd ${WORKSPACE}/test && \
nosetests --with-xunit -s test_run_setup.py && \
qsub qsub_tests.sh && \
nosetests --with-xunit -s test_run_outputs.py then repro module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && \
cd ${WORKSPACE}/test && \
nosetests -s --with-xunit test_bit_reproducibility.py Similarly for ACCESS-OM2 The reproducibility test is run every week module use /g/data/hh5/public/modules && module load conda/analysis3-unstable && python -m pytest -s test/test_bit_reproducibility.py Some other jobs have been set up to allow testing PRs, by hand editing the configuration and setting off the job https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_PR/activity |
I've started trying to think about what this framework might look like. I've included some thoughts below, including a first draft at a potential design. I expect there're issues, but hopefully this will at least be helpful to form some discussion around. General requirements/constraints
Reproducibility testing scope
→ Run reasonably rarely (e.g. only on test model configurations, not all model runs), with KGOs updated even more rarely Performance testing scope
→ Some use cases run frequently (e.g. after every model run), keep track of and compare performance stats through time Framework design (to get discussion started)The simplest design is one that performs tests on output from models that have already been built and run. All the framework includes is a set of classes (one per model) that specify where/how to parse model-specific output files, and a test suite for comparing outputs to a database of benchmarks (KGOs, baseline performance metrics). This test suite could be run as a “postprocess” step using the preferred run tool for that model. This means tests are triggered by running model test configurations. Thoughts
|
The tool for extracting out repro hashes (or timing info) will always run. So one option to detect changes is to use Using GitHub as a repro and performance data database has a lot going for it, not the least that others could use the same tools and store their own repro and performance data there. GitHub topics could be used to make model performance data discoverable, even from different institutions. Picking on a few points
As somewhere to store the repos containing the reproducibility and performance data?
Yes! I definitely endorse the idea of adding a schema version as meta-data so that this can be queried and gracefully handled. e.g. if the format is updated and new fields added older versions can be read in and written back out with new fields added where appropriate. |
I'm not sure I understand exactly what you mean by this, but I like the idea of using git to check/update changes, at least as a first pass.
Those and the test configurations (e.g. Payu configs, though I'm not sure how this would work for the rose/cylc stuff) |
When I say the tool for extracting the info will always run: it is a very cheap process, so I'd imagine it always being done, whether or not the reproducibility status is actually being checked. In a way this is passive reproducibility/provenance. The data is always generated and sometimes it is actively checked, but would always be in the git repo, say, so could be queried a later date if required, e.g. forensic analysis to check how and when answers changed. |
FYI, I'm starting to flesh out something here: |
After having a chat with @aidanheerdegen regarding some scaling tests I've been running with MOM6, I realized you might be interested in some experience I had with a project I worked on in my previous life. The project aimed at extracting all the possible information from calculations performed with a wide range of codes (>50), store all that information on a database and develop some tools that make use of that data (e.g., machine learning, data mining, etc). (if you curious, you can have a look at it here and the corresponding code here) Happy to share some thoughts on how to best structure the code base, how to defined meta-data specifications, writing model parsers, etc. |
Thanks @micaeljtoliveira - this looks like a really interesting (and big) project! You're thoughts and experience would be really valuable here. Maybe easiest to start with a chat in-person and go from there? I'll be in Canberra next week. I'm also interested to hear about what you've been doing with MOM6, as performance testing is bundled in with what we're trying to achieve here. |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/porting-csiro-umui-access-esm1-5-ksh-run-script-to-payu/1611/3 |
I tried running the current
The
|
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/1 |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/3 |
@aidanheerdegen Has ACCESS-NRI decided on how to use |
No decision has been made. I need to read some documentation about |
@MartinDix uses |
The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted. |
That said, if we intend to contribute code changes upstream, we would probably also need corresponding |
As a model component developer I want a tool verify model reproducibility. As a model developer I want the same tool to work for all the components of the model.
As a release team member I want to be able to use the same tool when developing CI tests for a number of different models.
As a user of models I want to be confident that model updates will not change the answers of my experiments, unless this has been specifically documented.
The text was updated successfully, but these errors were encountered: