Skip to content

System Testing Guide

Samuel Levis edited this page May 29, 2024 · 18 revisions

Contents

If you are new to system testing with create_test, we recommend you read this whole guide linearly. You may find that you never run create_test directly, instead relying on CTSM's run_sys_tests wrapper script. However, it is still helpful to know about create_test, since run_sys_tests is just a wrapper to that underlying script.

If you want to jump right in with running a test suite, and/or if you already understand the CESM/CIME test system well, you can jump down to Running test suites with the run_sys_tests wrapper.

If you know all about how to use these testing tools, but have just been asked to act as an integrator – doing final testing before bringing a branch to master – you can jump down to Notes for integrators.

System tests are useful for:

  • Verifying various requirements
    • Runs to completion
    • Restarts bit-for-bit
    • Results independent of processor count
    • Threading
    • Compilation with debug flags, e.g., to pick up:
      • Array bounds problems
      • Floating point errors
    • And other specialty tests (e.g., init_interp)
  • Verifying those requirements across a wide range of model configurations (e.g., making sure CTSM still works when you turn on prognostic crops)
  • Making sure that you haven't introduced a bug that changes answers in some configurations, when answer changes are unexpected
    • This is one of the most powerful aspects of the system tests
    • For this to work best, you should try to separate your changes into:
      • Bit-for-bit refactoring
      • Answer-changing modifications that are as small as possible, and so can be carefully reviewed

The cime test system runs tests that involve:

  1. Doing one or more runs of the model and verifying that they run to completion. (If there was more than one run in a single test, then there is some change in configuration between the different runs.)
  2. If there was more than one run in a single test, then comparing those runs to ensure they were bit-for-bit identical as expected.
  3. If desired, comparing results with existing baselines (to ensure results are bit-for-bit the same as before), and/or generating new baselines.
  4. Providing final test results: An overall PASS/FAIL as well as PASS/FAIL status for individual parts of the test.

A test name looks like this; bracketed components are optional:

Testtype[_Testopt].Resolution.Compset.Machine_Compiler[.Testmod]

(There may be more than one Testopt, separated by underscores.)

Notice that this string specifies all required options to create_newcase.

An example is:

SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default

Testtype: code specifying the type of test to run; common test types are given in the table below. The SMS test in the above example is a basic smoke test.

Testopt: One or more options modifying some high-level configuration options. In the above example, we are compiling in debug mode (_D), and running for 3 days (_Ld3).

Resolution: Any resolution that you would typically specify via the --res argument to create_newcase. In the above example, we are using the f10_f10_musgs resolution, which is a coarse-resolution global grid that is good for testing.

Compset: Any compset that you would typically specify via the --compset option to create_newcase. In the above example, we are using the I1850Clm50BgcCrop compset.

Machine: The name of the machine you are running on (cheyenne in the above example).

Compiler: The name of the compiler to use (intel in the above example).

Testmod: A directory containing arbitrary user_nl_* contents and xmlchange commands. See below for more details.

The following are the most commonly used test types and their meaning:

SMS: Basic smoke test: Just does a single run

ERS: Exact restart test: Compares two runs, ensuring that they give bit-for-bit identical results:

  1. Straight-through run, which writes a restart file just over half-way through
  2. Restart run starting from the restart file written by (1)

ERP: Exact restart with changed processor count: This covers the exact restart functionality of the ERS test, and also halves the processor count in run (2). In addition, if multiple threads are used, it also halves the thread count in run (2). Thus, in addition to ensuring that restarts are bit-for-bit, it also ensures that answers do not depend on processor count, and optionally that answers do not depend on threading. This is nice in that a single test can verify a few of our most important system requirements. However, when the test fails, it can sometimes be harder to track down the cause for the problem. (To debug a failed ERP test, you can run the same configuration in an ERS, PEM and/or PET test.)

The following are the most commonly used test options (optional strings appearing after the test type, separated by _):

_D: Compile in debug mode. Exactly what this does depends on the compiler. Typically, this turns on checks for array bounds and various floating point traps. The model will run significantly slower with this option.

_L: Specifies the length of the run. The default for most tests is 5 days. Examples are _Ld3 (3 days), _Lm6 (6 months), and _Ly5 (5 years).

_P: Specifies the processor count of the run. Syntax is _PNxM where N is the number of tasks and M is the number of threads per task. For example, _P32x2 runs with 32 tasks and 2 threads per task. Default layouts of standalone CTSM all have just 1 thread per task, but the ability to run with threading (and get bit-for-bit identical answers) is an important requirement. Thus, many of our tests (and particularly ERP tests) specify processor layouts that use 2 threads per task.

Few CTSM tests simply run an out-of-the-box compset without any other modifications. Testmods provide a facility to make arbitrary changes to xml and namelist variables for this particular test. They typically serve two purposes:

  1. Adding more frequent history output, additional history streams, and/or additional history variables. The more frequent history output is particularly important, since otherwise a short (e.g., 5-day) test would not produce any CTSM diagnostic output (since the default output frequency is monthly).
  2. Making configuration changes specific to this test, such as turning on a non-default parameterization option.

Testmods directories are assumed to be in cime_config/testdefs/testmods_dirs. Dashes are used in place of slashes in the path relative to that directory. So a testmod of clm-default is found in cime_config/testdefs/testmods_dirs/clm/default/.

Testmods directories can contain three types of files:

  • user_nl_* files: The contents of these files are copied into the appropriate user_nl file (e.g., user_nl_clm) in the case directory. This allows you to set namelist options.

  • shell_commands: This file can contain xmlchange commands that change the values of xml variables in the case.

  • include_user_mods: Often you want a testmod that is basically the same as some other testmod, but with a few extra changes. For example, many of our testmods use the default testmod as a starting point, then add a few things on top of that. include_user_mods allows you to set up these relationships without resorting to unmaintainable copy & paste. This file contains the relative path to another testmod directory to include; for example, its contents may be:

    ../default
    

    First, the user_nl_* and shell_commands contents from the included testmod are applied, then the contents from the current testmod are applied. (So changes from the current testmod take precedence in case of conflicts.)

    These includes are applied recursively, if you include a directory that itself has an include_user_mods file. Also, in principle, an include_user_mods file can include multiple testmods (one per line), but in practice we rarely do that, because it tends to be more confusing than helpful.

Running a single test is as simple as doing the following from cime/scripts:

./create_test TESTNAME

For example:

./create_test SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default

In contrast to create_newcase, create_test automatically runs case.setup, case.build and case.submit for you - so that single create_test command will build and run your case.

A full list of possible options to create_test can be viewed by running create_test -h. Here are some of the most useful options:

  • -r /path/to/test/root: By default, the test's case directory is placed in the directory given by CIME_OUTPUT_ROOT (e.g., /glade/scratch/$USER on cheyenne). This has the benefit that the bld and run directories are nested under the case directory. However, if your scratch space is cluttered, this can make it hard to find your test cases later. If you specify a different directory with the -r (or --test-root) option, your test cases will appear there, instead. Specifying -r . will put your test cases in the current directory (analogous to the operation of create_newcase). This option is particularly useful when running large test suites: We often find it useful to put all tests within a given test suite within a subdirectory of CIME_OUTPUT_ROOT - for example, -r /glade/scratch/$USER/HELPFULLY_NAMED_SUBDIRECTORY.
  • --walltime HH:MM: By default, the maximum queue wallclock time for each test is generally the maximum allowed for the machine. Since tests are generally short, using this default may result in your jobs sitting in the queue longer than is necessary. You can use the --walltime option to specify a shorter queue wallclock time, thus allowing your jobs to get through the queue faster. However, note that all tests will use the same maximum walltime, so be sure to pick a time long enough for the longest test in a test suite. (Note: If you are running a full test suite with the xml options documented below, walltime limits may already be specified on a per-test basis. However, as of the time of this writing, this capability is not yet used for the CTSM test suites.)

As a test runs through its various phases (setup, build, run, etc.), it updates a file named TestStatus in the test's case directory. After a test completes, a typical TestStatus file will look like this:

PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=175
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=96
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=606
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test

(This is from a test that had comparisons with baselines, which we have not described yet.)

The three possible status codes you may see are:

  • PASS: This phase finished successfully
  • FAIL: This phase finished with an error
  • PEND: This phase is currently running, or has not yet started. (If a given phase is listed as PEND, subsequent phases may not be listed yet in the TestStatus file.)

By the time a test completes, you should typically see all PASS status values to indicate that the test completed successfully. However, we often ignore FAIL values for TPUTCOMP and MEMCOMP (which compare throughput and memory usage with the baseline), because system variability can cause these to fail even when there isn't a real problem.

More detailed test output can be found in the file named TestStatus.log in the test's case directory. This is the first place you should look if a test has failed.

Many test types perform two runs and then compare the output from the two, expecting bit-for-bit identical output. For example, an ERS test compares a straight-through run with a restart run. The comparison is done by comparing the last set of history files from each run. (If, for example, there are h0 and h1 history files, then this will compare both the last h0 file and the last h1 file.) These comparisons are done via a custom tool named cprnc, which compares each field and, if differences are found, computes various statistics on these differences.

If any one of these comparisons fails, you will see a line like:

FAIL ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest

As usual, more details can be found in TestStatus.log, where you will find output like this:

2017-09-26 10:10:24: Comparing hists for case 'ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud' dir1='/glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run', suffix1='base',  dir2='/glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run' suffix2='rest'
  comparing model 'datm'
    no hist files found for model datm
  comparing model 'clm'
    /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.rest
    cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out
    /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.rest
    cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.base.cprnc.out
  comparing model 'sice'
    no hist files found for model sice
  comparing model 'socn'
    no hist files found for model socn
  comparing model 'mosart'
    no hist files found for model mosart
  comparing model 'cism'
    no hist files found for model cism
  comparing model 'swav'
    no hist files found for model swav
  comparing model 'cpl'
    /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.rest
    cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.base.cprnc.out
FAIL

Notice the lines that say did NOT match. Also notice the lines pointing you to various *.cprnc.out files. (For convenience, *.cprnc.out files from failed comparisons are also copied to the case directory.) These output files from cprnc contain a lot of information. Most of what you need, though, can be determined via:

  1. Examining the last 10 or so lines:

    $ tail -10 ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out
    
    SUMMARY of cprnc:
     A total number of    487 fields were compared
              of which    340 had non-zero differences
                   and      0 had differences in fill patterns
                   and      0 had different dimension sizes
     A total number of      2 fields could not be analyzed
     A total number of      0 fields on file 1 were not found on file2.
      diff_test: the two files seem to be DIFFERENT
    
  2. Looking for lines referencing RMS errors:

    $ grep RMS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out
     RMS ACTUAL_IMMOB                     3.4138E-11            NORMALIZED  1.1947E-04
     RMS AGNPP                            3.9135E-14            NORMALIZED  1.0836E-08
     RMS AR                               1.4793E-10            NORMALIZED  1.2585E-05
     RMS BAF_PEATF                        6.9713E-23            NORMALIZED  2.4249E-12
     RMS BGNPP                            3.2774E-14            NORMALIZED  9.1966E-09
     RMS BTRAN2                           2.5167E-07            NORMALIZED  2.7111E-07
     RMS BTRANMN                          2.5532E-07            NORMALIZED  6.0307E-07
     RMS CH4PROD                          1.3658E-15            NORMALIZED  7.5109E-08
     RMS CH4_SURF_AERE_SAT                6.6191E-12            NORMALIZED  1.6114E-04
     RMS CH4_SURF_AERE_UNSAT              1.2635E-22            NORMALIZED  5.1519E-13
     ...
    

Notice that this lists all fields that differ, along with their RMS and normalized RMS differences.

It is often useful to run multiple tests at once (i.e., a test suite), covering different test types, different compsets, different compilers, etc.

This can be done by simply listing each test on the create_test command-line, as in:

./create_test SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default

However, it is often more convenient to create a file listing each of the tests you want to run. This way you can easily run the same test suite again later.

To do this, simply create a text file containing your test list, with one test per line:

SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default
ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default

Then run create_test with the -f (or --testfile) option:

./create_test -f TESTFILE

(where TESTFILE gives the path to the file you just created).

The -r and --walltime options described in Options to create_test are useful here, too. The -r option is particularly helpful for putting all of the tests in the test suite together in their own directory.

You can check the individual TestStatus files in each test of your test suite, but that gets old pretty quickly. An easier way to check the results of a test suite is to run the cs.status.TESTID command that is put in your test root (where TESTID is the unique id that was used for this test suite).

If you run this cs.status command, you will see output like the following:

20170926_093725_gq431o
  ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default (Overall: PASS) details:
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=175
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=96
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=606
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP
    PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test
  SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default (Overall: PASS) details:
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=16
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=202
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=374
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP
    PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test

This aggregates the results of all of the tests in the test suite, and also gives an Overall PASS or FAIL result for each test. Reviewing this output manually can be tedious, so some options can help you filter the results:

  • The -f / --fails-only option to cs.status allows you to see only test failures
  • The --count-performance-fails option suppresses line-by-line output for performance comparisons that often fail due to machine variability; instead, this just gives a count of the number of non-PASS results (FAIL or PEND) at the bottom.
  • The -c PHASE / -count-fails PHASE option can be used to suppress line-by-line output for the given phase (e.g., NLCOMP or BASELINE), instead just giving a count of the number of non-PASSes (FAILs or PENDs) for that phase. This is useful when you expect failures for some phases – often, phases related to baseline comparisons. This option can be specified multiple times.

So a typical use of cs.status.TESTID will look like this:

./cs.status.20170926_093725_gq431o -f --count-performance-fails

or, if you expect NLCOMP and BASELINE failures:

./cs.status.20170926_093725_gq431o -f --count-performance-fails -c NLCOMP -c BASELINE

In addition to running your own individual tests or test suites, you can also use create_test to run a pre-defined test suite. Most CESM components have a policy that a particular test suite must be run before changes can be merged back to the master branch. These test suites are defined in xml files in each component.

To determine what pre-defined test suites are available and what tests they contain, you can run cime/scripts/query_testlists (run query_testlists -h for usage information).

Test suites are retrieved in create_test via three selection attributes:

  • The test category, specified with --xml-category (e.g., --xml-category aux_clm; see Test categories for other options)
  • The machine, specified with --xml-machine (e.g., --xml-machine cheyenne)
  • The compiler, specified with --xml-compiler (e.g., --xml-compiler intel) (although it's also possible to leave this out and run all tests for this category and machine in a single test suite)

So a component's testing policy may state something like: You must run the tests from the aux_clm category for these machine/compiler combinations: cheyenne/intel, cheyenne/gnu, hobart/nag and hobart/pgi.

So, for example, to run the subset of the aux_clm test suite that runs on cheyenne with the intel compiler, you can run:

./create_test --xml-category aux_clm --xml-machine cheyenne --xml-compiler intel

The -r option described in Options to create_test is particularly useful here for putting all of the tests in the test suite together in their own directory.

create_test uses multiple threads aggressively to speed up the process of setting up and building all of the cases in your test suite. On a shared system,this can turn you into a bad neighbor and get you in trouble with your system administrator. If possible, you should submit the create_test job to a compute node rather than running it on the login node. CTSM's run_sys_tests command does this automatically for you on our main test machines; see Running test suites with the run_sys_tests wrapper for details.

If you can't build the test suite on compute nodes, here are some helpful tips on running large test suites on the login node:

  • It's a good idea to run create_test with the unix nohup command in case you lose your connection.
  • Run create_test with the unix nice command to give it a lower scheduling priority
  • Specify a smaller number of parallel jobs via the --parallel-jobs option to create_test (the default is the number of cores available on a single node of the machine)

Putting this all together, a typical create_test command for running a pre-defined test suite might look like this:

nohup nice -n 19 ./create_test --xml-category aux_clm --xml-machine cheyenne --xml-compiler intel -r /glade/scratch/$USER/HELPFULLY_NAMED_SUBDIRECTORY --parallel-jobs 6

Testing that various configurations run to completion and that given variations are bit-for-bit with each other can only take you so far. The strongest tool we have for determining that your changes haven't broken anything are baseline comparisons. These compare the output from the current version of the code against the output from a previous version to determine if answers have changed at all in the new version.

Depending on what you have changed, you may expect:

  1. No answer changes, e.g., if you are doing an answer-preserving code refactoring, or adding a new option but not changing anything with respect to existing options
  2. Answers change only for certain configurations, e.g., if you change CTSM-crop code, but don't expect any answer changes for runs without the crop model
  3. Answers change for most or all configurations, but only in a few diagnostic fields that don't feed back to the rest of the system
  4. Answers change for most or all configurations

You may think that most changes fall into (4). With some care, however, it is often possible to separate large changes to the model science into:

  • Bit-for-bit modifications that can be tested against baselines - e.g., renaming variables and moving code around, either before or after your science changes
  • Answer-changing modifications; try to make these as small as possible (in terms of lines of code changed) so that they can be more easily reviewed for correctness.

You should then run the test suite separately on these two classes of changes, ensuring that the parts of the change that you expect to be bit-for-bit truly are bit-for-bit. The effort it takes to do this separation pays off in the increased confidence that you haven't introduced bugs.

First, you need to determine what to use as a baseline. Generally this is the version of master from which you have branched, or a previous, well-tested version of your branch.

If you're comparing against a version of master and have access to the main development machine(s) for the given component, then baselines may already exist. (e.g., on cheyenne, baselines go in /glade/p/cgd/tss/ctsm_baselines by default). Otherwise, you'll need to generate your own baselines.

If you need to generate baselines, you can do so by:

  • Checking out the baseline code version
  • Running create_test from the baseline code with these options:
    • --baseline-root /PATH/TO/BASELINE/ROOT: Specifies the directory in which baselines should be placed. This is optional, but is needed if you don't have write access to the default baseline location on this machine.
    • --generate GENERATE_NAME: Specifies a name for these baselines. Baselines for individual tests are placed under /PATH/TO/BASELINE/ROOT/GENERATE_NAME. For example, this could be a tag name or an abbreviated git sha-1.

If you're generating baselines for a full test suite (as opposed to just one or a few tests of your choosing), you may have to run multiple create_test invocations, possibly on different machines, in order to generate a full set of baselines. Each component has its own policies regarding the test suite that should be run for baseline comparisons.

After the test suite finishes, you can check the results as normal. Now, though, you should see an extra line in the TestStatus files or the output from cs.status, labeled GENERATE. A PASS status for this phase indicates that files were successfully copied to the baseline directory. You can confirm this by looking through /PATH/TO/BASELINE/ROOT/GENERATE_NAME: There should be a directory for each test in the test suite, containing history files, namelist files, etc.

Comparison against baselines is done similarly to generation (as described in Baseline comparisons step 2: Generate baselines, if needed), but now you should use the -c-ompare COMPARE_NAME flag to create_test. You should still specify --baseline-root /PATH/TO/BASELINE/ROOT. You can optionally specify --generate GENERATE_NAME, but if you do, make sure that GENERATE_NAME differs from COMPARE_NAME! (In this case, create_test will compare against some previous baselines while also generating new baselines for later use.)

After the test suite finishes, you can check the results as normal. Now, though, you should see an extra line in the TestStatus files or the output from cs.status, labeled BASELINE. A PASS status for this phase indicates that all history file types were bit-for-bit identical to their counterparts in the given baseline directory. (For each history file type - e.g., cpl hi, clm h0, clm h1, etc. - comparisons are just done for the last history file of that type.)

Checking the results of failed baseline comparisons is similar to checking the results of failed in-test comparisons. See Finding more details on failed comparisons for details. However, whereas failed in-test comparisons are put in a file named *.nc.base.cprnc.out, failed baseline comparisons are put in a file named *.nc.cprnc.out (without the base; yes, this is a bit counter-intuitive).

If you expect differences in just a small number of tests or a small number of diagnostic fields, you can confirm that the differences in the baseline comparisons are just what you expected. The tool cime/CIME/non_py/cprnc/summarize_cprnc_diffs facilitates this; run cime/CIME/non_py/cprnc/summarize_cprnc_diffs -h for details.

In addition to the baseline comparisons of history files, comparisons are also performed for:

  • Namelists (NLCOMP). For details on a NLCOMP failure, see TestStatus.log
  • Model throughput (TPUTCOMP). However, note that system variability can cause this to fail even when there isn't a real problem.

It sometimes happens that you want to generate or compare baselines from an already-run test suite. Some reasons this may happen are:

  • You forgot to specify --generate or --compare when you ran the test suite.
  • You wanted to wait to see if the test suite was successful before generating baselines.
  • You ran baseline comparisons against one set of baselines, but now want to run comparisons against a different set of baselines.

There are two complementary tools for doing this:

  • cime/CIME/Tools/bless_test_results: after-the-fact baseline generation
  • cime/CIME/Tools/compare_test_results: after-the-fact baseline comparison

The usage messages for these are a bit confusing, due to the different workflows used in ACME vs. CESM. A typical usage of compare_test_results for CESM would look like this:

./compare_test_results -b BASELINE_NAME --baseline-root BASELINE_ROOT -r TEST_ROOT -t TEST_ID

where:

  • -b BASELINE_NAME (or --baseline-root BASELINE_NAME) corresponds to --compare COMPARE_NAME for create_test
  • --baseline-root corresponds to the same argument for create_test
  • -r (or --test-root) corresponds to the same argument for create_test
  • -t TEST_ID (or --test-id TEST_ID) is either the test-id you specified with the -t (or --test-id) argument to create_test, or the auto-generated test-id that was appended to each of your tests (a date and time stamp followed by a string of random characters)

To make it easier and less error-prone to run a suite of system tests, we have put together the run_sys_tests script, which can be found at the top level of a CTSM checkout. This is a wrapper to one or more invocations of create_test, so all of the above information still applies.

Major benefits of using this wrapper script are:

  • You don't need to know the set of compilers a test suite is defined for on our main test machines: just run this wrapper script and it will run create_test on all defined compilers for the given test suite and machine.
  • On our main test machines, multiple create_test invocations are submitted as separate jobs to the compute nodes.
  • Sensible defaults are chosen for the testroot directory and test ID of a test suite. In addition, a symbolic link is made from the current directory to the testroot directory, making it easier to find the testroot directory later.
  • Custom cs.status scripts are created that add arguments to aggregate across all tests in a full test suite and filter out pass results.
  • Extra error-checking is done, such as making you explicitly state whether you want to compare against and/or generate baselines.
  • Useful information is output to both the screen and a file in the test directory (SRCROOT_GIT_STATUS) giving a variety of git information about your current directory.

The primary purpose of this script is to assist with running full test suites, such as the aux_clm and clm_short test suites (via the -s / --suite-name argument). However, it can also be used to run individual tests (via the -t / testname argument) or all tests listed in a plain text file (via the -f / --testfile argument).

Typical usage of this script is simply:

./run_sys_tests -s SUITE_NAME -c COMPARE_NAME -g GENERATE_NAME [--baseline-root /PATH/TO/BASELINE/ROOT]

For example, to run the aux_clm test suite, replace SUITE_NAME with aux_clm (similarly for clm_short, fates, etc.; see Test categories for other options). This automatically detects the machine and launches the appropriate components of the given test suite on that machine. A symbolic link will be created in the current directory pointing to the testroot directory that contains all of the test directories in the test suite. (The path to this directory is also output to the screen.)

Note that the -c / --compare and -g / --generate arguments are required, unless you specify --skip-compare and/or --skip-generate.

The --baseline-root argument is optional, but is needed if you are generating baselines and don't have write access to the default baseline location on this machine.

This can also be used to run tests listed in a text file (via the -f / --testfile argument), or tests listed individually on the command line (via the -t / --testname argument).

For any SUITE_NAME that runs the Python system tests, see an additional requirement in Pre-merge system testing.

After running the run_sys_tests command, you will see output describing a variety of git information about your current directory, ending with the git-fleximod status. Then run_sys_tests will exit. This is normal, correct operation: Depending on the machine, run_sys_tests will either submit the create_test jobs to the batch queue or will run them in the background.

Run ./run_sys_tests -h for more details.

As noted in Checking the results of a test suite, you can run a cs.status.TESTID command to see the results of all tests in a test suite. run_sys_tests also creates two additional cs.status files to make it quicker and easier to parse the results from a test suite:

  • cs.status (only created with the -s / --suite-name argument): aggregates across all test IDs in this test suite, rather than requiring you to run a separate cs.status.TESTID command for each compiler.
  • cs.status.fails (created with all modes of operation): adds options to show only test failures (-f) and to suppress line-by-line output of performance failures, instead just giving a summary of these failures at the bottom (--count-performance-fails). With the -s / --suite-name argument, cs.status.fails also aggregates across all test IDs, as for cs.status.

Both versions also include expected failure integration, as described in Expected test failures. These versions of cs.status also accept the -c / --count-fails argument described in Checking the results of a test suite.

Here are some general tips for running test suites:

  • It is very important to not change anything in your CTSM directory (i.e., your git clone) once you start the test suite, until all tests in the test suite finish running.
  • On cheyenne, set the PROJECT environment variable in your shell startup file, or use some other mechanism to specify a default project / account code to cime. This way, you won't need to add the --project argument every time you run create_test, run_sys_tests, or create_newcase.

This section is for those who have been asked to do final system testing on a branch before merging it into the CTSM master branch (or another tightly-controlled branch like one of the release branches).

To ensure that the Python system tests included in the system testing pass, we recommend the following steps in your ctsm directory in the same terminal where you will run run_sys_tests:

> ./py_env_create # you may not need to rerun if you have run this before

> module unload python

> module load conda

> conda activate ctsm_pylib

> module load nco

The following tests should be run:

  • run_sys_tests -s aux_clm -c PREVIOUS_TAG -g NEW_TAG on cheyenne
  • run_sys_tests -s aux_clm -c PREVIOUS_TAG -g NEW_TAG on izumi

These take a few hours to run, and the cheyenne test suite costs a few thousand core-hours.

If you don't have permissions to create a new directory in the baseline directory space on a machine (/glade/p/cgd/tss/ctsm_baselines on cheyenne and /fs/cgd/csm/ccsm_baselines on izumi), you can:

  1. Make your own ctsm_baselines directory in a space you control (I recommend using your scratch directory)
  2. Make a symbolic link to the previous tag's baselines in the above directory (e.g., ln -s /glade/p/cgd/tss/ctsm_baselines/ctsm1.0.dev010 /glade/scratch/$USER/ctsm_baselines/ctsm1.0.dev010)
  3. When running run_sys_tests, point to your ctsm_baselines directory via the --baseline-root argument.
  4. When system testing is done, ask someone with permission to copy the generated baselines to the official baseline location.

If you don't need to do baseline generation yet, then use the --skip-generate option instead of -g.

When you run run_sys_tests, the test directories will be placed in a top-level directory in your scratch space with a name that begins with tests_. The path to this top-level directory will be output to the screen when you run run_sys_tests, and a symbolic link will be placed in the directory from which you invoked this command.

To check the test results, run the ./cs.status.fails script in the top-level test directory. This will show you just the failed tests. For more information, see Parsing test suite results and Checking the results of a test suite.

Although we do our best to keep all of the system tests passing, there are typically a few that are expected to fail at any given time. So, before you spend time looking into a failure from a test suite, you should check to see if it is in the list of expected failures. If you are using the cs.status or cs.status.fails scripts created by run_sys_tests, then expected failures are noted for you in the test results. Otherwise, read the rest of this section to learn how to find expected failures manually.

The list of expected failures is maintained under the CTSM checkout, at cime_config/testdefs/ExpectedTestFails.xml. Search for the failing test and see if it appears there; if so, confirm that it is failing in the same phase as before. For example, if you see:

<test name="ERS_Lm20_Mmpi-serial.1x1_smallvilleIA.I2000Clm50BgcCropGs.cheyenne_gnu.clm-monthly">
  <phase name="RUN">
    <status>FAIL</status>
    <issue>#158</issue>
  </phase>
</test>

then a FAIL in the RUN phase for this test is acceptable, but a failure in an earlier phase (such as during the build) would indicate a new problem.

Note that, if a test is expected to FAIL in the RUN phase, you might also see a PEND result for another phase, like COMPARE_base_rest; this is not a problem.

It's also possible that a previously-failing test is now passing. If so, this test should probably be removed from the expected fails list (unless the issue is that this test fails only sporadically). One way to notice tests that are newly passing is: If you see a BFAIL for the BASELINE comparison phase for a test, and find that there are no hist files in the baseline: see if this was in the expected fails list, since a newly-passing test is a common cause of this test result. If a test is newly-passing, you should consider removing the test from the ExpectedTestFails.xml list and marking the relevant issue as resolved. Check with other integrators / reviewers if you are unsure whether to do this.

Some of the test categories used in CTSM are:

  • CTSM-specific test lists
    • aux_clm: These tests should be run before merging a branch to master or a release branch.
    • clm_short: This is a small subset of aux_clm that can be run frequently in the course of working on changes.
      • All tests in this list should also appear in the aux_clm list to ensure that baselines exist for all tags.
    • fates: Additional tests run by FATES developers
  • CESM test lists
    • prealpha: These tests are run before making a CESM alpha or beta tag.
      • All tests in this list should also appear in the aux_clm list (or at least have a very similar test in aux_clm) to prevent surprises in CESM alpha testing.
    • prebeta: These tests are run before making a CESM beta tag.
      • All tests in this list should also appear in the aux_clm list (or at least have a very similar test in aux_clm) to prevent surprises in CESM beta testing.
      • prealpha tests do NOT need to be repeated here, since any CESM beta tag also has the prealpha test suite run on it.
    • aux_cime_baselines: These tests are run frequently (e.g., nightly) to ensure that changes to cime do not change answers unexpectedly.
      • This should be a small list of tests (3-4 tests defined by each component). (We want this to stay small since this list is run frequently. So it should cover the most important configurations, but won't cover everything.) Because the main purpose is baseline comparisons, all tests can be basic smoke (SMS) tests. All tests should be on the same machine/compiler (currently cheyenne_intel). Because the purpose is testing cime, tests in this test list should be chosen to exercise different cime options, such as different time periods and/or datm modes.
      • It is common for people to run the prealpha test list on their cime branch to make sure they haven't broken anything before merging a big set of changes to master. Thus, to ensure that that manual testing includes any important baseline comparisons, all tests in aux_cime_baselines should have close counterparts in prealpha. (In many cases, the prealpha test will be an expanded form of the aux_cime_baselines test - e.g., an ERP_Ld10 test rather than a SMS_Ld3 test.)
Clone this wiki locally