Create framework for extracting, verifying and updating model reproducibility and performance #83

aidanheerdegen · 2022-08-12T06:27:46Z

As a model component developer I want a tool verify model reproducibility. As a model developer I want the same tool to work for all the components of the model.

As a release team member I want to be able to use the same tool when developing CI tests for a number of different models.

As a user of models I want to be confident that model updates will not change the answers of my experiments, unless this has been specifically documented.

aidanheerdegen · 2022-09-14T04:04:07Z

Existing tools

Specific tools developed by Paul Leopardi to extract performance stats for scaling testing
https://github.com/penguian/performance-analysis
Andrew Kiss' tool for extracting information from ACCESS-OM2 runs to create a run summary
https://github.com/aekiss/run_summary
Marshall Ward's tools for performance monitoring of MOM6
https://github.com/NOAA-GFDL/MOM6/tree/dev/gfdl/.testing/tools

MartinDix · 2022-09-14T22:31:41Z

UM rose stem testing checks for bitwise reproducibility by comparing output files to Known Good Output (KGO) files on disk (e.g. in /g/data/access/KGO/standard_jobs) using mule-cumf (compares UM fieldsfiles).

Also compares Helmholtz solver statistics from model log files to those in KGO file.

***************************************************************
*    Linear solve for Helmholtz problem                       *
* Outer Inner Iterations   InitialError       FinalError      *
*    2     1        8      0.122268E+00      0.991602E-04     *
*    2     2        2      0.123264E-02      0.985385E-04     *
***************************************************************

These are effectively a global checksum and even with the limited precision any differences in a run show up in a few hours. Values could be stored in a DB somewhere rather than extracted from the KGO.

There are also comparisons of execution time by extracting run time from model logfiles
PE 0 Elapsed Wallclock Time: 51.48
but in practice elapsed time for short test runs on gadi is too variable for this to be much use.

JULES rose stem uses bitwise comparison of netCDF KGO files with nccmp.

LFRic rose stem uses a simple checksum (global sum of X²) of selected fields calculated during the run.

aidanheerdegen · 2022-09-15T01:22:54Z

Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.

As for extracting performance information, it is likely this would be limited to runs of sufficient length to give useful data.

Do the UM, JULES/CABLE, LFric have internal timing infrastructure that can be configured? For example MOM (via FMS) has user-configurable clocks for recording timings with varying levels of granularity. In this way MOM5 (and other FMS models) can give useful timing information from short runs by isolating timings from different sections of the program.

MartinDix · 2022-09-15T01:56:50Z

The UM has internal timers which give inclusive timing at a high level, e.g. radiation. See the end of /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_safe_n48_ga_amip_exp_2day/vn13.0/pe_output/atmos.fort6.pe0. Essentially the same as the MOM timers.

It also has an interface to the DrHook library from ECMWF which gives subroutine level timing via start and end calls in each routine. This can have a significant overhead and isn't used routinely. See /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_drhook_safe_n48/vn13.0/drhook.prof.1.

With both of these, I've sometimes added timers around small blocks of code when optimising.

JULES has the DrHook interface but no separate timers. In UM timer output JULES will be included in the boundary layer.

LRFic has something similar to the UM timers.

aidanheerdegen · 2022-09-16T03:51:12Z

Of course there are also the tools Nic Hannah developed for testing ACCESS-OM2

https://github.com/COSIMA/access-om2/tree/master/test

and MOM5

https://github.com/mom-ocean/MOM5/tree/master/test

The MOM5 tests are run regularly on Jenkins, first MOM5_run

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_run/activity

which runs

module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && 
module load pbs && \
cd ${WORKSPACE}/test && \
nosetests --with-xunit -s test_run_setup.py && \
qsub qsub_tests.sh && \
nosetests --with-xunit -s test_run_outputs.py

then repro MOM5_bit_reproducibility

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_bit_reproducibility/activity

module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && \
cd ${WORKSPACE}/test && \
nosetests -s --with-xunit test_bit_reproducibility.py

Similarly for ACCESS-OM2

The reproducibility test is run every week

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/ACCESS-OM2%2Freproducibility/activity

module use /g/data/hh5/public/modules && module load conda/analysis3-unstable && python -m pytest -s test/test_bit_reproducibility.py

Some other jobs have been set up to allow testing PRs, by hand editing the configuration and setting off the job

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/ACCESS-OM2%2Freproducibility_pull_request/activity

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_PR/activity

dougiesquire · 2022-12-09T04:36:39Z

I've started trying to think about what this framework might look like. I've included some thoughts below, including a first draft at a potential design. I expect there're issues, but hopefully this will at least be helpful to form some discussion around.

General requirements/constraints

Readily applied to different models managed with different run tools (payu, rose/cylc). The MOM6, MOM5 and ACCESS-OM2 test frameworks above all either rely on, or include, the ability to build and run the model as part of their tests - this is difficult to generalise across models and run tools
Deployable in different environments (Gadi, CI)
Able to help diagnose cause of test failures (e.g. UM reproducibility with intel-compiler/2021.7.0 #84)

Reproducibility testing scope

Has my new build of this model affected model output? Compare the output of a test run with Known Good Output (KGO).

→ Run reasonably rarely (e.g. only on test model configurations, not all model runs), with KGOs updated even more rarely

Performance testing scope

Has my change to this model (or the way that it's run) affected model performance?
Assess scaling of a model
Track model performance through time
Identify bottlenecks

→ Some use cases run frequently (e.g. after every model run), keep track of and compare performance stats through time

Framework design (to get discussion started)

The simplest design is one that performs tests on output from models that have already been built and run. All the framework includes is a set of classes (one per model) that specify where/how to parse model-specific output files, and a test suite for comparing outputs to a database of benchmarks (KGOs, baseline performance metrics). This test suite could be run as a “postprocess” step using the preferred run tool for that model. This means tests are triggered by running model test configurations.

Thoughts

It could be useful to separate the reproducibility and performance testing frameworks, since their scopes are a little different. Running performance tests could simply write a yml into to the work directory containing PBS info (which presumably will be available for all models) plus additional information. Test suite could compare to previous run, baseline run. Additional tools could plot performance through time, scalability etc.
Need to allow for schemas of KGOs/performance metrics to change through time.
Do we need a github organisation for ACCESS testing?

aidanheerdegen · 2022-12-14T05:38:03Z

The tool for extracting out repro hashes (or timing info) will always run.

So one option to detect changes is to use git: if directed to the same output in your test data repo then git can tell you if they've changed. Equally all that is required to update them is git commit && git push. I'm not saying this is the best option, but it is one. A downside is that it isn't explicitly checking the semantic contents of the files, just that they have changes at all. So perhaps that isn't suitable.

Using GitHub as a repro and performance data database has a lot going for it, not the least that others could use the same tools and store their own repro and performance data there. GitHub topics could be used to make model performance data discoverable, even from different institutions.

Picking on a few points

Do we need a github organisation for ACCESS testing?

As somewhere to store the repos containing the reproducibility and performance data?

Need to allow for schemas of KGOs/performance metrics to change through time.

Yes! I definitely endorse the idea of adding a schema version as meta-data so that this can be queried and gracefully handled. e.g. if the format is updated and new fields added older versions can be read in and written back out with new fields added where appropriate.

dougiesquire · 2022-12-14T05:58:08Z

The tool for extracting out repro hashes (or timing info) will always run.

I'm not sure I understand exactly what you mean by this, but I like the idea of using git to check/update changes, at least as a first pass.

As somewhere to store the repos containing the reproducibility and performance data?

Those and the test configurations (e.g. Payu configs, though I'm not sure how this would work for the rose/cylc stuff)

aidanheerdegen · 2022-12-14T06:16:35Z

When I say the tool for extracting the info will always run: it is a very cheap process, so I'd imagine it always being done, whether or not the reproducibility status is actually being checked. In a way this is passive reproducibility/provenance. The data is always generated and sometimes it is actively checked, but would always be in the git repo, say, so could be queried a later date if required, e.g. forensic analysis to check how and when answers changed.

dougiesquire · 2023-01-27T04:50:40Z

FYI, I'm starting to flesh out something here:

https://github.com/dougiesquire/morte

aidanheerdegen · 2023-01-27T05:20:31Z

Love the name. Conjures up post-mortem and of course ..

micaeljtoliveira · 2023-02-22T23:44:43Z

After having a chat with @aidanheerdegen regarding some scaling tests I've been running with MOM6, I realized you might be interested in some experience I had with a project I worked on in my previous life.

The project aimed at extracting all the possible information from calculations performed with a wide range of codes (>50), store all that information on a database and develop some tools that make use of that data (e.g., machine learning, data mining, etc). (if you curious, you can have a look at it here and the corresponding code here)

Happy to share some thoughts on how to best structure the code base, how to defined meta-data specifications, writing model parsers, etc.

dougiesquire · 2023-02-23T00:04:35Z

Thanks @micaeljtoliveira - this looks like a really interesting (and big) project! You're thoughts and experience would be really valuable here. Maybe easiest to start with a chat in-person and go from there? I'll be in Canberra next week.

I'm also interested to hear about what you've been doing with MOM6, as performance testing is bundled in with what we're trying to achieve here.

access-hive-bot · 2023-11-20T01:34:36Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/porting-csiro-umui-access-esm1-5-ksh-run-script-to-payu/1611/3

penguian · 2024-02-05T05:26:50Z

Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.

I tried running the current mule-cumf to check the restart dump produced by my ESM1.5 runs. The archive.orig run uses the original coe executables and the archive.build-gadi.1 uses executables built using https://github.com/penguian/access-esm-build-gadi .

[pcl851@gadi-login-06 access-esm]$ diff -U0 archive.orig/access-esm/output000/config.yaml archive.build-gadi.1/access-esm/output000/config.yaml
--- archive.orig/access-esm/output000/config.yaml	2024-02-02 11:16:20.000000000 +1100
+++ archive.build-gadi.1/access-esm/output000/config.yaml	2024-02-05 11:05:24.000000000 +1100
@@ -13 +13 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/um7.3x
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/um_hg3.exe
@@ -20 +20 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/mom5xx
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/mom5xx
@@ -28 +28 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/cicexx
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/cice4.1_access-mct-12p-20240205

The mule-cumf tool states that both runs fail validation checks. Is there an earlier version of cumf that can be used to check the output of UM 7.3?

$ mule-cumf archive.orig/access-esm/restart000/atmosphere/restart_dump.astart archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
* (CUMF-II) Module Information *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

mule       : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py (version 2022.07.1)
um_utils   : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/__init__.py (version 2022.07.1)
um_packing : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_packing/__init__.py (version 2022.07.1) (packing lib from SHUMlib: 2023061)


/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: archive.orig/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/mule-cumf", line 10, in <module>
    sys.exit(_main())
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 1385, in _main
    comparison = UMFileComparison(um_files[0], um_files[1])
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 728, in __init__
    diff_field = difference_op([field_1, field_2])
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 952, in __call__
    new_field = self.new_field(source, *args, **kwargs)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 293, in new_field
    data1 = fields[0].get_data()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 730, in get_data
    data = self._data_provider._data_array()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/ff.py", line 193, in _data_array
    data = np.fromstring(data_bytes, dtype, count=count)
ValueError: string is smaller than requested size

access-hive-bot · 2024-02-07T23:54:48Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/1

access-hive-bot · 2024-02-09T02:27:08Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/3

penguian · 2024-11-11T00:22:21Z

@aidanheerdegen Has ACCESS-NRI decided on how to use rose stem tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?

aidanheerdegen · 2024-11-11T04:34:35Z

@aidanheerdegen Has ACCESS-NRI decided on how to use rose stem tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?

No decision has been made. I need to read some documentation about rose stem testing, but if you had an example suite for testing on NCI that I could look at it that would be helpful.

penguian · 2024-11-13T05:52:31Z

@MartinDix uses rose stem to test each each UM release. I believe that this rose stem test is just a subset of the tests in https://code.metoffice.gov.uk/trac/um/browser/main/trunk/rose-stem

MartinDix · 2024-11-13T21:26:09Z

The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.

penguian · 2024-11-13T23:52:31Z

The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.

That said, if we intend to contribute code changes upstream, we would probably also need corresponding rose stem tests.

aidanheerdegen changed the title ~~Create framework for extracting, verifying and updating model reproducibility~~ Create framework for extracting, verifying and updating model reproducibility and performance Sep 15, 2022

aidanheerdegen mentioned this issue Sep 15, 2022

Refine existing reproducibility tests ACCESS-NRI/reproducibility#3

Closed

dougiesquire mentioned this issue Dec 8, 2022

ACCESS-NRI Development Meeting 12/12/2022 ACCESS-NRI/team-meetings#10

Open

aidanheerdegen assigned dougiesquire Feb 10, 2023

aidanheerdegen mentioned this issue Nov 21, 2023

Add CI workflow ACCESS-NRI/access-om2-configs#1

Closed

This was referenced Feb 26, 2024

Add 0.1 degree model configurations ACCESS-NRI/access-om2-configs#16

Closed

Add 0.25 degree model configurations ACCESS-NRI/access-om2-configs#17

Closed

Add 1 degree BGC configurations ACCESS-NRI/access-om2-configs#19

Closed

aidanheerdegen mentioned this issue May 1, 2024

Test updated software stack ACCESS-NRI/access-om2-configs#107

Open

aidanheerdegen transferred this issue from ACCESS-NRI/reproducibility Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create framework for extracting, verifying and updating model reproducibility and performance #83

Create framework for extracting, verifying and updating model reproducibility and performance #83

aidanheerdegen commented Aug 12, 2022

aidanheerdegen commented Sep 14, 2022

MartinDix commented Sep 14, 2022

aidanheerdegen commented Sep 15, 2022

MartinDix commented Sep 15, 2022 •

edited

Loading

aidanheerdegen commented Sep 16, 2022

dougiesquire commented Dec 9, 2022 •

edited

Loading

aidanheerdegen commented Dec 14, 2022

dougiesquire commented Dec 14, 2022

aidanheerdegen commented Dec 14, 2022

dougiesquire commented Jan 27, 2023

aidanheerdegen commented Jan 27, 2023

micaeljtoliveira commented Feb 22, 2023

dougiesquire commented Feb 23, 2023

access-hive-bot commented Nov 20, 2023

penguian commented Feb 5, 2024

access-hive-bot commented Feb 7, 2024

access-hive-bot commented Feb 9, 2024

penguian commented Nov 11, 2024

aidanheerdegen commented Nov 11, 2024

penguian commented Nov 13, 2024

MartinDix commented Nov 13, 2024

penguian commented Nov 13, 2024

Create framework for extracting, verifying and updating model reproducibility and performance #83

Create framework for extracting, verifying and updating model reproducibility and performance #83

Comments

aidanheerdegen commented Aug 12, 2022

aidanheerdegen commented Sep 14, 2022

MartinDix commented Sep 14, 2022

aidanheerdegen commented Sep 15, 2022

MartinDix commented Sep 15, 2022 • edited Loading

aidanheerdegen commented Sep 16, 2022

dougiesquire commented Dec 9, 2022 • edited Loading

General requirements/constraints

Reproducibility testing scope

Performance testing scope

Framework design (to get discussion started)

Thoughts

aidanheerdegen commented Dec 14, 2022

dougiesquire commented Dec 14, 2022

aidanheerdegen commented Dec 14, 2022

dougiesquire commented Jan 27, 2023

aidanheerdegen commented Jan 27, 2023

micaeljtoliveira commented Feb 22, 2023

dougiesquire commented Feb 23, 2023

access-hive-bot commented Nov 20, 2023

penguian commented Feb 5, 2024

access-hive-bot commented Feb 7, 2024

access-hive-bot commented Feb 9, 2024

penguian commented Nov 11, 2024

aidanheerdegen commented Nov 11, 2024

penguian commented Nov 13, 2024

MartinDix commented Nov 13, 2024

penguian commented Nov 13, 2024

MartinDix commented Sep 15, 2022 •

edited

Loading

dougiesquire commented Dec 9, 2022 •

edited

Loading