Add restart reproducibility test into repro CI #30

jo-basevi · 2024-03-22T03:21:38Z

Enable restart reproducibility checksum checks - as detailed in #11

I've added the marker checksum to the restart checksum tests, added a model-specific checksum comparison method (as format of checksums may different depending on the model).

For access-om2 model, it is basically doing the same check as before, i.e any field and checksum value in the 2-day model run is in either of the first or second 1 day model runs.

I have manually tested it with a 1deg_jra55_ryf configuration where I’ve just updated the executables paths to point to the corresponding pre-release locations with the fix - config.yaml used. Currently it's failing the test_bit_repro_historical test - so the checksums (historical-3hr-checksum.json) are different to checksums saved on the release configuration branch, is that to be expected?

Also have noticed the tests are taking ages to run - as it's mostly waiting in queues - wondered whether its worth dropping the qsub walltime and memory for short tests. But that might not be the best idea as different configurations may need varying resources..

With adding the checksum marker, this pytest will be run as part of the reproducibility checks. It outputs some checksum files if it fails, I haven't added them to the checksum output sub-directory as that is rsynced to github. Should it be rsynced to github? I don't think it should be added to the testing/checksum on each configuration branch. Another todo, is maybe update the reproducibility fail messages in the CI as the test can fail in either the historical checksums or the restart test.

aidanheerdegen · 2024-03-22T06:53:52Z

I have manually tested it with a 1deg_jra55_ryf configuration where I’ve just updated the executables paths to point to the corresponding pre-release locations with the fix - config.yaml used. Currently it's failing the test_bit_repro_historical test - so the checksums (historical-3hr-checksum.json) are different to checksums saved on the release configuration branch, is that to be expected?

No. Given that the checksum was generated automatically by the workflow @CodeGat created I'm surprised. It should be the same.

https://github.com/ACCESS-NRI/access-om2-configs/blob/release-1deg_jra55_ryf/testing/checksum/historical-3hr-checksum.json

However it does sounds like the restart reproducibility is working, is that right?

Also have noticed the tests are taking ages to run - as it's mostly waiting in queues - wondered whether its worth dropping the qsub walltime and memory for short tests. But that might not be the best idea as different configurations may need varying resources..

That is a good suggestion. I don't know if it's possible to do a way that it might be possible to accomodate different requirements.

Another possibility is to use an express queue. If the resources used are pretty modest we might not care too much, but in the long run if we do a lot of these tests it might add up. So reducing resource requests is the best approach in the first instance.

With adding the checksum marker, this pytest will be run as part of the reproducibility checks. It outputs some checksum files if it fails, I haven't added them to the checksum output sub-directory as that is rsynced to github. Should it be rsynced to github?

I don't think so. A user can always go and find the files on gadi if the details of the failure are important to investigate.

Another todo, is maybe update the reproducibility fail messages in the CI as the test can fail in either the historical checksums or the restart test.

Good idea. @CodeGat might have some ideas about that.

aidanheerdegen

Looks great, thanks @jo-basevi!

As time is tight I think we should merge and leave improving throughput of the tests to a later date.

access-hive-bot · 2024-03-26T09:04:27Z

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/access-om2-bit-repro-testing/1960/1

Add restart reproducibility test to checksum pytests

1fc583f

jo-basevi requested a review from aidanheerdegen March 22, 2024 03:55

aidanheerdegen approved these changes Mar 26, 2024

View reviewed changes

aidanheerdegen merged commit d8a972b into main Mar 26, 2024

aidanheerdegen mentioned this pull request Mar 26, 2024

Reduce running time of reproducibility tests ACCESS-NRI/model-config-tests#4

Open

jo-basevi deleted the 11-Add-restart-reproducibility-tests branch May 13, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add restart reproducibility test into repro CI #30

Add restart reproducibility test into repro CI #30

jo-basevi commented Mar 22, 2024 •

edited

Loading

aidanheerdegen commented Mar 22, 2024

aidanheerdegen left a comment •

edited

Loading

access-hive-bot commented Mar 26, 2024

Add restart reproducibility test into repro CI #30

Add restart reproducibility test into repro CI #30

Conversation

jo-basevi commented Mar 22, 2024 • edited Loading

aidanheerdegen commented Mar 22, 2024

aidanheerdegen left a comment • edited Loading

Choose a reason for hiding this comment

access-hive-bot commented Mar 26, 2024

jo-basevi commented Mar 22, 2024 •

edited

Loading

aidanheerdegen left a comment •

edited

Loading