zppy debugging guide for developers #573

forsyth2 · 2024-04-11T22:38:30Z

forsyth2
Apr 11, 2024
Maintainer

zppy has many complex inter-connected pieces. It can therefore be a challenging package to develop and debug (and debugging zppy could well mean debugging a package it calls).

This guide aims to be a starting point for developers trying to debug zppy.

Check if the issue has come up before

If you have a specific line of code to search for (e.g. Error: ...), you can use the GitHub search bar in the upper right hand corner of the zppy repo to search for that line. It may have appeared in previous issues/PRs/discussions.

For more complicated questions, you can also look through the discussions page.

`zppy` skipped a job it shouldn't have

zppy will report what dependency is missing when it skips a job. Look at your cfg to determine why zppy is looking for that dependency. See #544 (comment) for an in-depth look at zppy dependency handling.

A job may also be skipped because its status file says "RUNNING", "WAITING", or "OK". Usually that means this job really shouldn't re-run anyway. However, it may be the case that you found a bug for which zppy doesn't exit with an error code. In these cases, simply delete the status file and re-run zppy.

Identifying errors in `.o` files

From #291 & https://e3sm-project.github.io/zppy/_build/html/main/tutorial.html#debugging-failures:

$ cd <output directory from cfg>/post/scripts # This is the directory where all the bash scripts `zppy` generated are
$ grep -v "OK" *status # See what failed
# Review `.o` files corresponding to failed `.status` files.
# If an error is obvious, make a fix in the bash file and rerun:
$ sbatch <failed job>.bash # re-run just the bash file itself. There's no need to re-run zppy here. 
# If the error is not obvious, do the following:
$ emacs <failed job>.bash
# In this file, set `debug = True`. This will provide more information.
# Alternatively, you can set `debug = True` in your `cfg` and rerun `zppy`.
$ sbatch <failed job>.bash

The error is in a prior task

For example, if you realize your error in global_time_series is really because of an error in ts, then you'll need to fix the ts task and then re-run zppy. It's recommended to either delete the old output and www directories or set them to a new path so you know you aren't re-using old output.

Reduce the number of jobs you have zppy launch by deleting or commenting out everything in your cfg that's not involved with the debugging. (E.g., if you're debugging global_time_series, you may need to re-run the ts task dependencies, but you don't need the climo task to re-run).

The error is in a package zppy calls

#570 provides a chart of which tasks use which packages. If the bug is ultimately in another package, then that package needs to be updated. Then, you can use environment_commands (or e3sm_to_cmip_environment_commands) to set a different environment so you can test zppy with the fixed version of the underlying package. Directions on how to do this can also be found in #570.

Could the problem be environment, data, or machine (rather than `zppy` itself)?

Does this problem resolve when...

Using a fresh environment? It is easy to accidentally run zppy tasks in the wrong environment by forgetting to set environment_commands accordingly or forgetting to run pip install .to apply the latest changes. So, double check your cfg to confirm environments are set correctly and/or try creating a new dev environment:

mamba clean --all
mamba env create -f conda/dev.yml -n zppy_dev_<date or issue #>
conda activate zppy_dev__<date or issue #>
pip install .

Using a different simulation as the input? Perhaps the data, not the code, is faulty. Or if one dataset works and another doesn't, that may tell us something about where the code may be broken.
Using a different machine? Perhaps the issue only arises on a particular machine.

Learning more about the data you have as input

You can run ncdump -h <file-name> to get a summary of data in files underneath your input directory.

ncdump -h <file-name> | grep float will show you the float variables defined in the file.

ncdump -h <file-name> | grep -E "float (var1|var2|...|varN)\(" to find specific float variables defined in the file.

It may be the case that the variables you're trying to process aren't even defined in your input file. In that case, the problem is with the data you're using -- either you need to find simulation output with the required variables or you need to remove the variables from your processing list (e.g., vars, plots_atm)

Creating a MCVE

It can be helpful to reduce a problem to the smallest possible example size -- a minimal complete verifiable example (MCVE). This is helpful both to you as a debugger and to others you show the problem too.

For example, from the zstash Bug Report template (https://github.com/E3SM-Project/zstash/issues/new/choose):

See guidelines below on how to provide a good MCVE:

Minimal Complete Verifiable Examples

Craft Minimal Bug Reports

This can be a real challenge in zppy since often the bug arises out of the many inter-connected pieces (hence why the MCVE question isn't even on the zppy bug report template). Sometimes though, it is possible. In these cases, creating a MCVE can be quite helpful.

For example, when debugging global_time_series, if you've identified a problem is in coupled_global.py -- you don't need to re-run zppy or even global_time_series.bash -- just look at global_time_series.bash to identify what parameters were used in the call to coupled_global.py and run coupled_global.py with those parameters yourself.

In rare cases it may even be possible to reduce the problem to a few lines of Python, in which case you can debug the problem in an interactive Python interpreter.

In most cases, the simplest way to make a MCVE is to create a minimal cfg: run on as few years as possible, run as few tasks as possible, run on as few variables as possible -- what specific parameter combination causes the problem?

Write a test

Once you find a bug, think if there's a test you can write that would catch this bug in the future. E.g., what combination of parameters or type of data causes this bug to appear? If we can get a test into the test suite for this bug, then it prevents future users from running into it too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zppy debugging guide for developers #573

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

zppy debugging guide for developers #573

forsyth2 Apr 11, 2024 Maintainer

Check if the issue has come up before

zppy skipped a job it shouldn't have

Identifying errors in .o files

The error is in a prior task

The error is in a package zppy calls

Could the problem be environment, data, or machine (rather than zppy itself)?

Learning more about the data you have as input

Creating a MCVE

Write a test

Replies: 0 comments

forsyth2
Apr 11, 2024
Maintainer

`zppy` skipped a job it shouldn't have

Identifying errors in `.o` files

Could the problem be environment, data, or machine (rather than `zppy` itself)?