Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add irt test #1837

Merged
merged 8 commits into from
Aug 25, 2017
Merged

Add irt test #1837

merged 8 commits into from
Aug 25, 2017

Conversation

jedwards4b
Copy link
Contributor

@jedwards4b jedwards4b commented Aug 24, 2017

Adds IRT (Interum Restart Test) using the system_test_compare_two methodology
This test revealed a bug in the driver config_archive.xml file which I also fixed.
Changed cime_developer tests from ERS to IRT

Test suite: scripts_regression_tests.py
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #594
Fixes #1731
Fixes #1653
Fixes #1813
User interface changes?:

Update gh-pages html (Y/N)?:

Code review:

@@ -42,8 +42,8 @@
"ERI.f09_g16.X",
"ERIO.f09_g16.X",
"SEQ_Ln9.f19_g16_rx1.A",
"ERS.ne30_g16_rx1.A",
"ERS_N2.f19_g16_rx1.A",
"IRT.ne30_g16_rx1.A",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we don't have any ERS tests in cime_developer. That can't be good, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IRT is an expansion on ERS - it does the same thing + more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jgfouca that there should be at least one ERS test in cime_developer, partly to make sure that the ERS test itself doesn't get broken.

Copy link
Contributor

@jgfouca jgfouca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline.

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments. There are a lot of comments, but I think they should generally be straightforward to address.

Also, I'd like to add unit tests for some of the new behavior in system_tests_compare_two, but I can do that in a follow-on PR.

@@ -21,6 +21,17 @@
that's needed in both cases. This is called before _case_one_setup or
_case_two_setup.

(2) _case_one_custom_prerun_action(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, change "MAY require the following method" to "MAY require the following methods" (i.e., make 'method' plural)

@@ -42,8 +42,8 @@
"ERI.f09_g16.X",
"ERIO.f09_g16.X",
"SEQ_Ln9.f19_g16_rx1.A",
"ERS.ne30_g16_rx1.A",
"ERS_N2.f19_g16_rx1.A",
"IRT.ne30_g16_rx1.A",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jgfouca that there should be at least one ERS test in cime_developer, partly to make sure that the ERS test itself doesn't get broken.

@@ -1224,6 +1224,11 @@ def _get_most_recent_lid_impl(files):

return sorted(results)

def sorted_ls(path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this function should indicate that this is sorted by timestamp - maybe something like ls_sorted_by_mtime

@@ -363,6 +363,7 @@ def test_run_phase_internal_calls(self):
Call(METHOD_component_compare_test,
{'suffix1': run_one_suffix, 'suffix2': run_two_suffix})
]
self.maxDiff = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this serve?

@@ -36,10 +47,11 @@ class SystemTestsCompareTwo(SystemTestsCommon):

def __init__(self,
case,
separate_builds,
separate_builds=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this remained a required argument so that test developers have to actively think of whether this should be True or False. (I was okay with @jgfouca changing it in his PR for symmetry with separate_run_dirs, but now that separate_run_dirs is gone, I'd prefer if this remained as it was before.)


def _case_two_setup(self):
stop_n = self._case1.get_value("STOP_N")
rest_n = stop_n/2 + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not self._case1.get_value("REST_N") - more robust if the formula changes, so you don't need to keep it in sync in these two places

self._case.set_value("REST_OPTION", "never")
# both cases need to refer to same archive directory
self._case.set_value("DOUT_S_ROOT", self._case1.get_value("DOUT_S_ROOT"))
case_setup(self._case)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is case_setup needed?

self._case.set_value("STOP_N", stop_new)
self._case.set_value("CONTINUE_RUN",True)
self._case.set_value("REST_OPTION", "never")
# both cases need to refer to same archive directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do both cases need to refer to the same archive directory? It seems like you could avoid setting this and instead, in _case_two_custom_prerun_action, get self._case1.get_value("DOUT_S_ROOT"). If that works, it seems that better communicates what's really going on.

case_st_archive(self._case)

def _case_two_custom_prerun_action(self):
dout_s_root = self._case.get_value("DOUT_S_ROOT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, can you change this to self._case1.get_value("DOUT_S_ROOT")?


def _case_two_custom_prerun_action(self):
dout_s_root = self._case.get_value("DOUT_S_ROOT")
restart_list = sorted_ls(os.path.join(dout_s_root,"rest"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an expect to ensure that the restart_list has more than 1 item in it.

@jedwards4b
Copy link
Contributor Author

@jgfouca @billsacks I think that all of your concerns have now been addressed.

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b thanks for addressing all of my earlier comments. I have added two new comments on your latest changes.

@@ -1,197 +0,0 @@
<components version="2.0">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's safe to remove these files? It looks like you may be assuming that this information has been split out into components, but I don't see config_component.xml files for any active CESM components.

Even if it is safe to remove these files: It seems the removal of this file and the similar cesm file should be in a separate PR. Or, if there is really a connection to this PR, then this removal should at least be noted in the top-level comments of the PR.

Also flagging @bertinia on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the file config_files.xml and the entry ARCHIVE_SPEC_FILE there all config_archive.xml files are distributed to components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it looks like that's wrong:

[roo2:~/cesm_code/cesm2_0_alpha07]$ find . -name config_archive.xml
./cime/config/acme/config_archive.xml
./cime/config/cesm/config_archive.xml
./cime/src/components/data_comps/datm/cime_config/config_archive.xml
./cime/src/components/data_comps/dice/cime_config/config_archive.xml
./cime/src/components/data_comps/dlnd/cime_config/config_archive.xml
./cime/src/components/data_comps/docn/cime_config/config_archive.xml
./cime/src/components/data_comps/drof/cime_config/config_archive.xml
./cime/src/components/data_comps/dwav/cime_config/config_archive.xml
./cime/src/drivers/mct/cime_config/config_archive.xml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe ARCHIVE_SPEC_FILE is trying to be future-proof for whenever the components do distribute those files.

@@ -273,6 +273,19 @@ NODEFAIL Tests restart upon detected node failure. Generates fake failu
<CONTINUE_RUN>FALSE</CONTINUE_RUN>
</test>

<test NAME="IRT">
<DESC>exact restart from startup, default 4 days + 7 days with restart from interum file</DESC>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interum should be interim

@billsacks
Copy link
Member

I approve now, thanks.

@jedwards4b
Copy link
Contributor Author

@jgfouca I would like to complete a test IRT_Ld5.f09_g17.B1850.cheyenne_intel.allactive-defaultio
before you merge. I'll let you know.

@jgfouca
Copy link
Contributor

jgfouca commented Aug 25, 2017

OK

@jedwards4b
Copy link
Contributor Author

@jgfouca test passes. Ready to merge.

@jgfouca jgfouca merged commit fab7ccd into ESMCI:master Aug 25, 2017
@jedwards4b jedwards4b deleted the add_irt_test branch August 25, 2017 20:19
billsacks added a commit to billsacks/cime that referenced this pull request Aug 31, 2017
The multisubmit bug related to the indentation of a block: this was
mistakenly being done for both phases of a multi-submit, whereas it only
should have been done after the second
phase. ESMCI#1830 had this correct, but the
indentation was incorrect in ESMCI#1837,
which is what came to master.
jgfouca pushed a commit that referenced this pull request Nov 7, 2017
High MPI overhead occurs on some systems the first time that
two processes communicate. In typical usage there are two types
of nonlocal message patterns: one based off of the root of each
component and one (in PIO) based off of root+1. By changing
the default for PIO_ROOT from one to zero, this start-up overhead
is approximately halved.

Using a PIO_ROOT value of zero also allows the default performance
timing settings to better capture PIO performance. Finally,
logically a PIO_ROOT value of one has no special advantage over
zero with current multi- and many-core processor architectures.
This has been verified in recent performance benchmarking on multiple
system and multiple cases, with a PIO_ROOT value of zero performing
better than a PIO_ROOT of one even in the RUN LOOP.

[BFB]
[NML]

* worleyph/cime/pio_root_zero_as_default:
  Set PIO_ROOT defaults to be zero instead of one

Conflicts:
	cime/src/drivers/mct/cime_config/config_component.xml
jgfouca pushed a commit that referenced this pull request Feb 23, 2018
High MPI overhead occurs on some systems the first time that
two processes communicate. In typical usage there are two types
of nonlocal message patterns: one based off of the root of each
component and one (in PIO) based off of root+1. By changing
the default for PIO_ROOT from one to zero, this start-up overhead
is approximately halved.

Using a PIO_ROOT value of zero also allows the default performance
timing settings to better capture PIO performance. Finally,
logically a PIO_ROOT value of one has no special advantage over
zero with current multi- and many-core processor architectures.
This has been verified in recent performance benchmarking on multiple
system and multiple cases, with a PIO_ROOT value of zero performing
better than a PIO_ROOT of one even in the RUN LOOP.

[BFB]
[NML]

* worleyph/cime/pio_root_zero_as_default:
  Set PIO_ROOT defaults to be zero instead of one

Conflicts:
	cime/src/drivers/mct/cime_config/config_component.xml
jgfouca pushed a commit that referenced this pull request Mar 13, 2018
High MPI overhead occurs on some systems the first time that
two processes communicate. In typical usage there are two types
of nonlocal message patterns: one based off of the root of each
component and one (in PIO) based off of root+1. By changing
the default for PIO_ROOT from one to zero, this start-up overhead
is approximately halved.

Using a PIO_ROOT value of zero also allows the default performance
timing settings to better capture PIO performance. Finally,
logically a PIO_ROOT value of one has no special advantage over
zero with current multi- and many-core processor architectures.
This has been verified in recent performance benchmarking on multiple
system and multiple cases, with a PIO_ROOT value of zero performing
better than a PIO_ROOT of one even in the RUN LOOP.

[BFB]
[NML]

* worleyph/cime/pio_root_zero_as_default:
  Set PIO_ROOT defaults to be zero instead of one

Conflicts:
	cime/src/drivers/mct/cime_config/config_component.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants