-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add irt test #1837
Add irt test #1837
Conversation
scripts/lib/update_acme_tests.py
Outdated
@@ -42,8 +42,8 @@ | |||
"ERI.f09_g16.X", | |||
"ERIO.f09_g16.X", | |||
"SEQ_Ln9.f19_g16_rx1.A", | |||
"ERS.ne30_g16_rx1.A", | |||
"ERS_N2.f19_g16_rx1.A", | |||
"IRT.ne30_g16_rx1.A", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we don't have any ERS tests in cime_developer. That can't be good, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IRT is an expansion on ERS - it does the same thing + more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jgfouca that there should be at least one ERS test in cime_developer, partly to make sure that the ERS test itself doesn't get broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments. There are a lot of comments, but I think they should generally be straightforward to address.
Also, I'd like to add unit tests for some of the new behavior in system_tests_compare_two, but I can do that in a follow-on PR.
@@ -21,6 +21,17 @@ | |||
that's needed in both cases. This is called before _case_one_setup or | |||
_case_two_setup. | |||
|
|||
(2) _case_one_custom_prerun_action(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above, change "MAY require the following method" to "MAY require the following methods" (i.e., make 'method' plural)
scripts/lib/update_acme_tests.py
Outdated
@@ -42,8 +42,8 @@ | |||
"ERI.f09_g16.X", | |||
"ERIO.f09_g16.X", | |||
"SEQ_Ln9.f19_g16_rx1.A", | |||
"ERS.ne30_g16_rx1.A", | |||
"ERS_N2.f19_g16_rx1.A", | |||
"IRT.ne30_g16_rx1.A", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jgfouca that there should be at least one ERS test in cime_developer, partly to make sure that the ERS test itself doesn't get broken.
scripts/lib/CIME/utils.py
Outdated
@@ -1224,6 +1224,11 @@ def _get_most_recent_lid_impl(files): | |||
|
|||
return sorted(results) | |||
|
|||
def sorted_ls(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this function should indicate that this is sorted by timestamp - maybe something like ls_sorted_by_mtime
@@ -363,6 +363,7 @@ def test_run_phase_internal_calls(self): | |||
Call(METHOD_component_compare_test, | |||
{'suffix1': run_one_suffix, 'suffix2': run_two_suffix}) | |||
] | |||
self.maxDiff = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose does this serve?
@@ -36,10 +47,11 @@ class SystemTestsCompareTwo(SystemTestsCommon): | |||
|
|||
def __init__(self, | |||
case, | |||
separate_builds, | |||
separate_builds=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this remained a required argument so that test developers have to actively think of whether this should be True or False. (I was okay with @jgfouca changing it in his PR for symmetry with separate_run_dirs, but now that separate_run_dirs is gone, I'd prefer if this remained as it was before.)
scripts/lib/CIME/SystemTests/irt.py
Outdated
|
||
def _case_two_setup(self): | ||
stop_n = self._case1.get_value("STOP_N") | ||
rest_n = stop_n/2 + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not self._case1.get_value("REST_N")
- more robust if the formula changes, so you don't need to keep it in sync in these two places
scripts/lib/CIME/SystemTests/irt.py
Outdated
self._case.set_value("REST_OPTION", "never") | ||
# both cases need to refer to same archive directory | ||
self._case.set_value("DOUT_S_ROOT", self._case1.get_value("DOUT_S_ROOT")) | ||
case_setup(self._case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is case_setup
needed?
scripts/lib/CIME/SystemTests/irt.py
Outdated
self._case.set_value("STOP_N", stop_new) | ||
self._case.set_value("CONTINUE_RUN",True) | ||
self._case.set_value("REST_OPTION", "never") | ||
# both cases need to refer to same archive directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do both cases need to refer to the same archive directory? It seems like you could avoid setting this and instead, in _case_two_custom_prerun_action
, get self._case1.get_value("DOUT_S_ROOT")
. If that works, it seems that better communicates what's really going on.
scripts/lib/CIME/SystemTests/irt.py
Outdated
case_st_archive(self._case) | ||
|
||
def _case_two_custom_prerun_action(self): | ||
dout_s_root = self._case.get_value("DOUT_S_ROOT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, can you change this to self._case1.get_value("DOUT_S_ROOT")
?
scripts/lib/CIME/SystemTests/irt.py
Outdated
|
||
def _case_two_custom_prerun_action(self): | ||
dout_s_root = self._case.get_value("DOUT_S_ROOT") | ||
restart_list = sorted_ls(os.path.join(dout_s_root,"rest")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an expect
to ensure that the restart_list has more than 1 item in it.
@jgfouca @billsacks I think that all of your concerns have now been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b thanks for addressing all of my earlier comments. I have added two new comments on your latest changes.
config/acme/config_archive.xml
Outdated
@@ -1,197 +0,0 @@ | |||
<components version="2.0"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's safe to remove these files? It looks like you may be assuming that this information has been split out into components, but I don't see config_component.xml files for any active CESM components.
Even if it is safe to remove these files: It seems the removal of this file and the similar cesm file should be in a separate PR. Or, if there is really a connection to this PR, then this removal should at least be noted in the top-level comments of the PR.
Also flagging @bertinia on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the file config_files.xml and the entry ARCHIVE_SPEC_FILE there all config_archive.xml files are distributed to components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it looks like that's wrong:
[roo2:~/cesm_code/cesm2_0_alpha07]$ find . -name config_archive.xml
./cime/config/acme/config_archive.xml
./cime/config/cesm/config_archive.xml
./cime/src/components/data_comps/datm/cime_config/config_archive.xml
./cime/src/components/data_comps/dice/cime_config/config_archive.xml
./cime/src/components/data_comps/dlnd/cime_config/config_archive.xml
./cime/src/components/data_comps/docn/cime_config/config_archive.xml
./cime/src/components/data_comps/drof/cime_config/config_archive.xml
./cime/src/components/data_comps/dwav/cime_config/config_archive.xml
./cime/src/drivers/mct/cime_config/config_archive.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe ARCHIVE_SPEC_FILE is trying to be future-proof for whenever the components do distribute those files.
config/config_tests.xml
Outdated
@@ -273,6 +273,19 @@ NODEFAIL Tests restart upon detected node failure. Generates fake failu | |||
<CONTINUE_RUN>FALSE</CONTINUE_RUN> | |||
</test> | |||
|
|||
<test NAME="IRT"> | |||
<DESC>exact restart from startup, default 4 days + 7 days with restart from interum file</DESC> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interum should be interim
I approve now, thanks. |
@jgfouca I would like to complete a test IRT_Ld5.f09_g17.B1850.cheyenne_intel.allactive-defaultio |
OK |
@jgfouca test passes. Ready to merge. |
The multisubmit bug related to the indentation of a block: this was mistakenly being done for both phases of a multi-submit, whereas it only should have been done after the second phase. ESMCI#1830 had this correct, but the indentation was incorrect in ESMCI#1837, which is what came to master.
High MPI overhead occurs on some systems the first time that two processes communicate. In typical usage there are two types of nonlocal message patterns: one based off of the root of each component and one (in PIO) based off of root+1. By changing the default for PIO_ROOT from one to zero, this start-up overhead is approximately halved. Using a PIO_ROOT value of zero also allows the default performance timing settings to better capture PIO performance. Finally, logically a PIO_ROOT value of one has no special advantage over zero with current multi- and many-core processor architectures. This has been verified in recent performance benchmarking on multiple system and multiple cases, with a PIO_ROOT value of zero performing better than a PIO_ROOT of one even in the RUN LOOP. [BFB] [NML] * worleyph/cime/pio_root_zero_as_default: Set PIO_ROOT defaults to be zero instead of one Conflicts: cime/src/drivers/mct/cime_config/config_component.xml
High MPI overhead occurs on some systems the first time that two processes communicate. In typical usage there are two types of nonlocal message patterns: one based off of the root of each component and one (in PIO) based off of root+1. By changing the default for PIO_ROOT from one to zero, this start-up overhead is approximately halved. Using a PIO_ROOT value of zero also allows the default performance timing settings to better capture PIO performance. Finally, logically a PIO_ROOT value of one has no special advantage over zero with current multi- and many-core processor architectures. This has been verified in recent performance benchmarking on multiple system and multiple cases, with a PIO_ROOT value of zero performing better than a PIO_ROOT of one even in the RUN LOOP. [BFB] [NML] * worleyph/cime/pio_root_zero_as_default: Set PIO_ROOT defaults to be zero instead of one Conflicts: cime/src/drivers/mct/cime_config/config_component.xml
High MPI overhead occurs on some systems the first time that two processes communicate. In typical usage there are two types of nonlocal message patterns: one based off of the root of each component and one (in PIO) based off of root+1. By changing the default for PIO_ROOT from one to zero, this start-up overhead is approximately halved. Using a PIO_ROOT value of zero also allows the default performance timing settings to better capture PIO performance. Finally, logically a PIO_ROOT value of one has no special advantage over zero with current multi- and many-core processor architectures. This has been verified in recent performance benchmarking on multiple system and multiple cases, with a PIO_ROOT value of zero performing better than a PIO_ROOT of one even in the RUN LOOP. [BFB] [NML] * worleyph/cime/pio_root_zero_as_default: Set PIO_ROOT defaults to be zero instead of one Conflicts: cime/src/drivers/mct/cime_config/config_component.xml
Adds IRT (Interum Restart Test) using the system_test_compare_two methodology
This test revealed a bug in the driver config_archive.xml file which I also fixed.
Changed cime_developer tests from ERS to IRT
Test suite: scripts_regression_tests.py
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes #594
Fixes #1731
Fixes #1653
Fixes #1813
User interface changes?:
Update gh-pages html (Y/N)?:
Code review: