Skip to content

CAM debugging techniques

cacraigucar edited this page Oct 16, 2024 · 16 revisions

This page contains techniques which are useful when the model is crashing. Note that this is not a page about using parallel debuggers since availability is often limited.

Debugging Techniques

First Steps

1. Build and run the model with compiler debug flags enabled:

If you are running into either build or run-time errors that you don't understand, then first try building and running the model with compiler debug flags enabled. This can be done by setting DEBUG to TRUE in env_build.xml in your case directory, and then re-building and re-running the model.

The model should still fail, but the information in the logs should be more informative, and may even point you to the exact line of code that has an error. Please note that if the model fails at run-time this "stack trace" to the problematic line is likely going to be in the cesm.log.XXXX file, which is usually located in your case's run directory.

Finally, if you are using the Intel compiler then you may receive a specific error code associated with your runtime failure. A description of what that error code means can be found on Intel's website.

2. Try a different compiler, if possible:

If running with debug flags did not help you track down the issue, then if possible try running with a different fortran compiler, ideally with DEBUG still set to TRUE. Often times one compiler will report an error that another compiler may simply ignore, or try to manage on its own. It might even generate a proper stack trace even if the first compiler failed to do so.

The easiest way to change your compiler (when on a supported machine) is to create a new CAM case using create_newcase with the --compiler flag that specifies which compiler to use. If you aren't sure which compilers are available on your machine, then run the command:

query_config --machines

and search for your particular machine's name, which should include a "compilers" line that lists all available compiler options. This command will be located in the same location as create_newcase. Again, please note that these instructions will only work for machines where CAM, CESM, or CIME have been properly ported.

3. Are your code changes being compiled?

Sometimes, it seems like you make a change to your code but nothing changes in the log or the output. Often, this is because your new code is not being compiled into the model. To check this, find these three pieces of information:

  • In your case directory, look at your CaseStatus file. Search from the bottom for the message, "case.build success". Record the date on that line.
  • Record the modification date of the source file you modified (ls -l <file>).
  • Also in your case directory, find the source directory (./xmlquery SRCROOT).

If the date / time of your source is newer than the date / time in your CaseStatus, you may just need to rebuild your case (./case.build). Depending on what changes you have made, you might also need to do a full rebuild (rm -rf bld; ./case.setup --reset; ./case.build). Make sure you are building the correct code (see next item).

If the CAM root directory containing your modified source does not match the SRCROOT directory, then you are modifying the wrong file (yeah, we've all done that). Either create a case from the right source directory or move your modifications into the SRCROOT directory.

CAM Snapshot

cam_snapshot is a set of routines which will write out all of the fields in state, constituents, tend, ptend, cam_in, cam_out and pbuf along with a few fields that are just local to tphysac and tphysbc. The times that these fields are written out are controlled by the "cam_snapshot_before" and "cam_snapshot_after" types of variables. "cam_snapshot_before" variables are used to capture the model variables before a particular physics parameterization is called and "cam_snapshot_after" is used to capture variables after the parameterization. cam_snapshot is controlled by four namelist variables:

  • cam_snapshot_before_num - the output file number for the before snapshots (for example, setting to 6 will result in the values being written to the h5 file)
  • cam_snapshot_after_num - the output file number for the after snapshots (for example, setting to 7 will result in the values being written to the h6 file)
  • cam_take_snapshot_before - the name of the parameterization before which all fields will be output
  • cam_take_snapshot_after - the name of the parameterization after which all fields will be output

In addition, it is almost always the case that a user will want to specify that the information is written out on every time step, so the corresponding elements in nhtfrq should be set to 1 in user_nl_cam

If the model is crashing, set the corresponding elements of mfilt to 1 in user_nl_cam.

If the cam_take_snapshot_before and cam_take_snapshot_after are set to the same parameterization, then the changes made by that particular parameterization are isolated. If they are set to different parameterizations, then the values will be output before the parameterization specified by cam_take_snapshot_before is called and after the cam_take_snapshot_after parameterization completes.

Stopping the model

The easiest way to stop CAM is to set STOP_OPTION and STOP_N to values which stop the model after a particular time step. However, if something is causing the model to crash, it can be helpful to stop the model at a particular point in the code. This can be done by inserting a call to endrun into the code.

First, ensure that endrun is imported. Make sure this statement is in the subroutine or module where you want to call endrun:

   use cam_abortutils,  only: endrun

Then, simply insert a call to endrun in the desired location:

   call endrun('Stopping CAM')

This message will show up in the atm.log.######.<machine>.<date>-<time> file if the masterproc (MPI task 0) hits this call and in the cesm.log.######.<machine>.<date>-<time> file for most other MPI tasks (sometimes the program quits before all messages are written to the log file).

To make the endrun message more useful, create a message:

   ! Add this statement where the other routine variables are declared
   character(len=256) :: error_msg
   [ . . . other statements . . .]
   ! Add these statements where you want to stop the model, add formatting in place of * if desired
   write(error_msg, *) 'Stopping the model because X (', x, ') is > 1234.0'
   call endrun(trim(error_msg))

The endrun call can even be added conditionally:

   if (x > 1234.0_r8) then
      write(error_msg, *) 'Stopping the model because X (', x, ') is > 1234.0'
      call endrun(trim(error_msg))
   end if

or

   if (nstep >= 12) then
      call endrun('Stopping the model at nstep >= 12')
   end if

cam_pio_dump_field

cam_pio_dump_field is a function which immediately writes a NetCDF file with information from a field. For example:

   call cam_pio_dump_field('CLD', 1, pcols, 1, pver, cld)

will write the field, cld, to a file called CLD_dump_<##>.nc where <##> is a number starting at one and increasing as this call is repeated. The file simply contains the contents as a 3-dimensional array where the first two dimensions are given by the bounds (1:pcols and 1:pver) and the third dimension is the MPI task number (1:npes).

cam_pio_dump_field can also handle 3, 4, and 6-dimensional fields, just call the function with the appropriate number of bounds for the field.

Note that by default, cam_pio_dump_field collects the bounds from all MPI tasks and uses the largest range for the NetCDF file. To skip this step, set the optional variable, compute_maxdim_in, to .false..

pbuf_dump_pbuf

pbuf_dump_pbuf is similar to cam_pio_dump_field in that it immediately writes NetCDF files. The main difference is that is cannot be called from a threaded region and requires access to the full pbuf (aka the pbuf2d variable). The call is:

pbuf_dump_pbuf(pbuf2d, name, num)

where pbuf2d is the full pbuf, name is an optional name to be added to each filename, and num is an optional integer to be added to each filename.

pbuf_dump_pbuf then writes a NetCDF file for each field in the pbuf for this run. The file format is the same for cam_pio_dump_field (see above).

Weird Error Messages

Invalid argument - setClock timeStep=10800s is not a divisor of runDuration=16200s

This is a common problem with short runs. This error message is a hint that there is a timing mismatch. The source is often that by default, the runoff model and/or the land ice model is running much less frequently than the atmosphere and the run length is not an integral multiple of the slowest component model.

The solution is to choose a run length that allows an integer number of component runs for every component (e.g., ./xmlchange STOP_OPTION=nsteps,STOP_N=48 when the atmosphere timestep is 30 minutes) or turn up the run frequency of the other components (./xmlchange ROF_NCPL=48,GLC_NCPL=48).

BTW, an easy way to get a snapshot of the various run frequencies is ./xmlquery --partial NCPL.

ERROR: Error gathering provenance information from manage_externals (OLDER VERSIONS OF CAM ONLY)

manage_externals error message: ERROR:root:SVN returned invalid XML message

This usually happens when running case.submit

The problem is usually a communication issue with the CGD svn server and the fix is to remove the entry for chem_proc from Externals_CAM.cfg