Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zppy errors in E3SM Unified 1.9.2rc2 #538

Closed
3 of 4 tasks
forsyth2 opened this issue Dec 22, 2023 · 15 comments
Closed
3 of 4 tasks

zppy errors in E3SM Unified 1.9.2rc2 #538

forsyth2 opened this issue Dec 22, 2023 · 15 comments
Labels
semver: bug Bug fix (will increment patch version)

Comments

@forsyth2
Copy link
Collaborator

forsyth2 commented Dec 22, 2023

Request criteria

  • I searched the zppy GitHub Discussions to find a similar question and didn't find it.
  • I searched the zppy documentation.
  • This issue does not match the other templates (i.e., it is not a bug report, documentation request, feature request, or a question.)

Issue description

Testing zppy on Chrysalis, using E3SM Unified 1.9.2rc2, I run into the following errors on the complete_run run. The errors appear to be similar on Perlmutter. Please note that there is not a new zppy release for E3SM Unified 1.9.2. That is, these errors are occuring on a zppy version that was previously tested (for E3SM Unified 1.9.1).

Error 1

e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_*:

This is the job that makes sure the environment_commands parameter is working properly.

1a

tests/integration/utils.py had "diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221", meaning it ran using a conda dev environment built off the latest main of E3SM Diags (E3SM-Project/e3sm_diags@9e14ff8)

===== RUN E3SM DIAGS =====

2023-12-21 14:25:12,848 [ERROR]: run.py(run_diags:90) >> Error traceback:
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 88, in run_diags
    params_results = main(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/e3sm_diags_driver.py", line 363, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/os.py", line 225, in\
 makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.histori\
cal_0201/post/scripts/tmp.447022.wCBq/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 92, in run_diags
    move_log_to_prov_dir(params_results[0].results_dir)
UnboundLocalError: local variable 'params_results' referenced before assignment
[WARNING] yaksa: 10 leaked handle pool objects
srun: error: chr-0496: task 0: Exited with exit code 1

Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.

1b

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh", which is the same environment_commands all the other jobs used. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (E3SM-Project/e3sm_diags@0b7f9c7).

===== RUN E3SM DIAGS =====

2023-12-21 20:06:05,915 [ERROR]: run.py(run_diags:37) >> Error traceback:
Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 35, in run_diags
    main(final_params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 362, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts/tmp.447067.MErK/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 38, in run_diags
    move_log_to_prov_dir(final_params[0].results_dir)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/logger.py", line 104, in move_log_to_prov_dir
    shutil.copy(LOG_FILENAME, provenance_dir)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/shutil.py", line 256, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/prov/e3sm_diags_run.log'
srun: error: chr-0493: task 0: Exited with exit code 1

Since this case tests the E3SM Diags version that is included in the upcoming Unified release, we should address this error.

1c

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh", which uses the latest official release of E3SM Unified. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.1, that looks like that would be E3SM Diags v2.9.0 (E3SM-Project/e3sm_diags@a2d00eb).

This works fine. That is expected since zppy had previously tested this version of E3SM Diags when we did the release for E3SM Unified 1.9.1.

Potential sources of the bugs

1c -> 1b bug:

1b bug -> 1a bug:

  • E3SM-Project/e3sm_diags@9e14ff8
    • This error indicates a failing diagnostic run because params_results is not set (only set if successful)
    • Add issue to e3sm_diags to fix this behavior (try, except, else)

Error 2

ilamb_*:

Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/bin/ilamb-run", line 993, in <module>
    S = Scoreboard(
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 508, in __init__
    TraversePreorder(self.tree, _initConfrontation)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 122, in TraversePreorder
    visit(node)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 470, in _initConfrontation
    node.confrontation = Constructor(**(node.__dict__))
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/ConfTWSA.py", line 37, in __init__
    self.basins = r.addRegionNetCDF4(
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Regions.py", line 139, in addRegionNetCDF4
    dset = Dataset(filename)
  File "src/netCDF4/_netCDF4.pyx", line 2464, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 2027, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: '/lcrc/group/e3sm/diagnostics/ilamb_data/DATA/mrro/Dai/basins_0.5x0.5.nc'
srun: error: chr-0496: task 0: Exited with exit code 1

I'm not sure if this is a bug with ILAMB itself or if there is simply a missing dataset (e.g., a dataset was deleted or the new version of ILAMB requires one that I'm not pointing to).

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Dec 22, 2023
@tomvothecoder
Copy link
Collaborator

Thanks for the clear description. I've crossed out the commits that I know wouldn't affect e3sm_diags between v2.9.0 and v2.10.0.

Can you provide me the standalone command for the e3sm_diags task? I will try stepping through the code on v2.10.0 up to where it breaks.

@forsyth2
Copy link
Collaborator Author

Can you provide me the standalone command for the e3sm_diags task?

Hmm I'm trying to figure out exactly how to condense that down. There's a lot of auto-generation (and NCO dependencies) that come first.

The relevant parts of the cfg (excluding any climo/ts dependencies) would be:

Excerpt of tests/integration/generated/test_bundles_chrysalis.cfg
[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh"
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/unified_1.9.2rc2"

[e3sm_diags]
active = True
grid = '180x360_aave'
ref_final_yr = 2014
ref_start_yr = 1985
# TODO: this directory is missing OMI-MLS
sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere", "tc_analysis",
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
walltime = "00:30:00"
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave_environment_commands ]]
  environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh"
  sets = "qbo",
  ts_subsection = "atm_monthly_180x360_aave"

So, that ends up generating a bash file, like this:

e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.bash
#!/bin/bash

# Running on chrysalis

#SBATCH  --job-name=e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851
#SBATCH  --account=e3sm
#SBATCH  --nodes=1
#SBATCH  --output=/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts/e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o%j
#SBATCH  --exclusive
#SBATCH  --time=00:30:00

#SBATCH  --partition=debug



source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh

# Turn on debug output if needed
debug=False
if [[ "${debug,,}" == "true" ]]; then
  set -x
fi

# Make sure UVCDAT doesn't prompt us about anonymous logging
export UVCDAT_ANONYMOUS_LOG=False

# Script dir
cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts

# Get jobid
id=${SLURM_JOBID}

# Update status file
STARTTIME=$(date +%s)
echo "RUNNING ${id}" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status

# Basic definitions
case="v2.LR.historical_0201"
short="v2.LR.historical_0201"
www="/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/unified_1.9.2rc2"
y1=1850
y2=1851
Y1="1850"
Y2="1851"

run_type="model_vs_obs"
tag="model_vs_obs"

results_dir=${tag}_${Y1}-${Y2}

# Create temporary workdir
workdir=`mktemp -d tmp.${id}.XXXX`
cd ${workdir}

create_links_climo()
{
  climo_dir_source=$1
  climo_dir_destination=$2
  nc_prefix=$3
  begin_year=$4
  end_year=$5
  error_num=$6
  mkdir -p ${climo_dir_destination}
  cd ${climo_dir_destination}
  cp -s ${climo_dir_source}/${nc_prefix}_*_${begin_year}??_${end_year}??_climo.nc .
  if [ $? != 0 ]; then
    cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
    echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
    exit ${error_num}
  fi
  cd ..
}

create_links_climo_diurnal()
{
  climo_diurnal_dir_source=$1
  climo_diurnal_dir_destination=$2
  nc_prefix=$3
  begin_year=$4
  end_year=$5
  error_num=$6
  mkdir -p ${climo_diurnal_dir_destination}
  cd ${climo_diurnal_dir_destination}
  cp -s ${climo_diurnal_dir_source}/${nc_prefix}._*_${begin_year}??_${end_year}??_climo.nc .
  if [ $? != 0 ]; then
    cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
    echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
    exit ${error_num}
  fi
  cd ..
}

create_links_ts()
{
  ts_dir_source=$1
  ts_dir_destination=$2
  begin_year=$3
  end_year=$4
  error_num=$5
  # Create xml files for time series variables
  mkdir -p ${ts_dir_destination}
  cd ${ts_dir_destination}
  # https://stackoverflow.com/questions/27702452/loop-through-a-comma-separated-shell-variable
  variables="FSNTOA,FLUT,FSNT,FLNT,FSNS,FLNS,SHFLX,QFLX,TAUX,TAUY,PRECC,PRECL,PRECSC,PRECSL,TS,TREFHT,CLDTOT,CLDHGH,CLDMED,CLDLOW,U"
  for v in ${variables//,/ }
  do
    # Go through the time series files for between year1 and year2, using a step size equal to the number of years per time series file
    for year in `seq ${begin_year} 2 ${end_year}`;
    do
      YYYY=`printf "%04d" ${year}`
      for file in ${ts_dir_source}/${v}_${YYYY}*.nc
      do
        # Add this time series file to the list of files for cdscan to use
        echo ${file} >> ${v}_files.txt
      done
    done
    # xml file will cover the whole period from year1 to year2
    xml_name=${v}_${begin_year}01_${end_year}12.xml
    export CDMS_NO_MPI=true
    cdscan -x ${xml_name} -f ${v}_files.txt
    if [ $? != 0 ]; then
      cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
      echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
      exit ${error_num}
    fi
  done
  cd ..
}

create_links_ts_rof()
{
  ts_rof_dir_source=$1
  ts_rof_dir_destination=$2
  begin_year=$3
  end_year=$4
  error_num=$5
  mkdir -p ${ts_rof_dir_destination}
  cd ${ts_rof_dir_destination}
  v="RIVER_DISCHARGE_OVER_LAND_LIQ"
  xml_name=${v}_${begin_year}01_${end_year}12.xml
  cdscan -x ${xml_name} ${ts_rof_dir_source}/${v}_*.nc
  if [ $? != 0 ]; then
    cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
    echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
    exit ${error_num}
  fi
  cd ..
}

ts_dir_primary=ts

# Create xml files for time series variables
ts_dir_source=/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/atm/180x360_aave/ts/monthly/2yr
create_links_ts ${ts_dir_source} ${ts_dir_primary} ${Y1} ${Y2} 5



ref_name=



# Run E3SM Diags
echo
echo ===== RUN E3SM DIAGS =====
echo

# Prepare configuration file
cat > e3sm.py << EOF
import os
import numpy
from e3sm_diags.parameter.core_parameter import CoreParameter
from e3sm_diags.parameter.qbo_parameter import QboParameter


from e3sm_diags.run import runner

short_name = '${short}'
test_ts = '${ts_dir_primary}'
start_yr = int('${Y1}')
end_yr = int('${Y2}')
num_years = end_yr - start_yr + 1
ref_start_yr = 1985

param = CoreParameter()

# Model
param.test_name = '${case}'
param.short_test_name = short_name

# Output dir
param.results_dir = '${results_dir}'

# Additional settings
param.run_type = 'model_vs_obs'
param.diff_title = 'Model - Observations'
param.output_format = ['png']
param.output_format_subplot = []
param.multiprocessing = True
param.num_workers = 24
#param.fail_on_incomplete = True
params = [param]
qbo_param = QboParameter()
qbo_param.test_data_path = test_ts
qbo_param.test_name = short_name
qbo_param.test_start_yr = start_yr
qbo_param.test_end_yr = end_yr
qbo_param.ref_start_yr = ref_start_yr
ref_end_yr = ref_start_yr + num_years - 1
if (ref_end_yr <= 2014):
  qbo_param.ref_end_yr = ref_end_yr
else:
  qbo_param.ref_end_yr = 2014

# Obs
qbo_param.reference_data_path = '/lcrc/group/e3sm/diagnostics/observations/Atm/time-series/'

params.append(qbo_param)

# Run
runner.sets_to_run = ['qbo']
runner.run_diags(params)

EOF

# Handle cases when cfg file is explicitly provided

command="srun -n 1 python -u e3sm.py"


# Run diagnostics
time ${command}
if [ $? != 0 ]; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
  echo 'ERROR (9)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
  exit 9
fi

# Copy output to web server
echo
echo ===== COPY FILES TO WEB SERVER =====
echo

# Create top-level directory
web_dir=${www}/${case}/e3sm_diags/atm_monthly_180x360_aave_environment_commands
mkdir -p ${web_dir}
if [ $? != 0 ]; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
  echo 'ERROR (10)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
  exit 10
fi



# Copy files
rsync -a --delete ${results_dir} ${web_dir}/
if [ $? != 0 ]; then
  cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts
  echo 'ERROR (11)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
  exit 11
fi




# For LCRC, change permissions of new files
pushd ${web_dir}/
chmod -R go+rX,go-w ${results_dir}
popd


# Delete temporary workdir
cd ..
if [[ "${debug,,}" != "true" ]]; then
  rm -rf ${workdir}
fi

# Update status file and exit

ENDTIME=$(date +%s)
ELAPSEDTIME=$(($ENDTIME - $STARTTIME))

echo ==============================================
echo "Elapsed time: $ELAPSEDTIME seconds"
echo ==============================================
rm -f e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status
echo 'OK' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status

So, the command is really srun -n 1 python -u e3sm.py, but there's a whole auto-generated e3sm.py.

@forsyth2
Copy link
Collaborator Author

Ah, ok, I think this would be the e3sm.py to try out. You still might need to run NCO to pre-process the data (i.e., what the ts task dependency does) though.

e3sm.py
import os
import numpy
from e3sm_diags.parameter.core_parameter import CoreParameter
from e3sm_diags.parameter.qbo_parameter import QboParameter


from e3sm_diags.run import runner

short_name = 'v2.LR.historical_0201'
test_ts = 'ts'
start_yr = int('1850')
end_yr = int('1851')
num_years = end_yr - start_yr + 1
ref_start_yr = 1985

param = CoreParameter()

# Model
param.test_name = 'v2.LR.historical_0201'
param.short_test_name = short_name

# Output dir
param.results_dir = 'model_vs_obs_1850-1851'

# Additional settings
param.run_type = 'model_vs_obs'
param.diff_title = 'Model - Observations'
param.output_format = ['png']
param.output_format_subplot = []
param.multiprocessing = True
param.num_workers = 24
#param.fail_on_incomplete = True
params = [param]
qbo_param = QboParameter()
qbo_param.test_data_path = test_ts
qbo_param.test_name = short_name
qbo_param.test_start_yr = start_yr
qbo_param.test_end_yr = end_yr
qbo_param.ref_start_yr = ref_start_yr
ref_end_yr = ref_start_yr + num_years - 1
if (ref_end_yr <= 2014):
  qbo_param.ref_end_yr = ref_end_yr
else:
  qbo_param.ref_end_yr = 2014

# Obs
qbo_param.reference_data_path = '/lcrc/group/e3sm/diagnostics/observations/Atm/time-series/'

params.append(qbo_param)

# Run
runner.sets_to_run = ['qbo']
runner.run_diags(params)

@forsyth2
Copy link
Collaborator Author

The other tests for zppy are fine. (Except for the bundles run, which also runs into the ILAMB error. That test doesn't check the environment_commands setting, so it wouldn't catch the other error).

@forsyth2
Copy link
Collaborator Author

Re: the ILAMB error, I ran with environment_commands set to use the E3SM Unified 1.9.1 version of ILAMB, and it worked fine. So, something changed on ILAMB between versions 2.6 and 2.7.

(Note that my note on #523 (comment) is to see if ILAMB 2.7 expands the zppy output and/or allows a simpler cfg. I can't really check that until I have 2.7 working in the first place).

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 22, 2023

1b

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh", which is the same environment_commands all the other jobs used. Looking at acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (E3SM-Project/e3sm_diags@0b7f9c7).

Okay I figured out the root cause of the e3sm_diags results_dir issue.

Root Cause

This commit E3SM-Project/e3sm_diags@48426aa (#755) changes if type(p) == cls_type to if isinstance(p, cls_type). Apparently, both of these conditionals are not the same (source) and can return different boolean values. I changed if type(p) == cls_type because it is considered bad practice (Flake8 E721).

As a result, in v2.10.0, the results_dir config is not being copied from the first parameter (param) to the second parameter (qbo_param). This causes results_dir to be blank which cascades to the FileNotFoundError: [Errno 2] No such file or directory: '' and FileNotFoundError: [Errno 2] No such file or directory: '/prov/e3sm_diags_run.log'.

The Fix

The fix is to change if isinstance(p, cls_type) to if type(p) is cls_type. I will open a separate PR for this and get a new e3sm_diags RC release out.

Other Thoughts

This brings my idea again about testing these tools together, outside of E3SM Unified. We should consider more frequent releases and periodic testing before E3SM Unified releases. It would really cut down potential bugs appearing with E3SM Unified releases at the last second. It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.

Also, the way e3sm_diags copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex. I wish it wasn't so fragile and hard to work with.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 22, 2023

1a

tests/integration/utils.py had "diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221", meaning it ran using a conda dev environment built off the latest main of E3SM Diags (E3SM-Project/e3sm_diags@9e14ff8)

===== RUN E3SM DIAGS =====

2023-12-21 14:25:12,848 [ERROR]: run.py(run_diags:90) >> Error traceback:
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 88, in run_diags
    params_results = main(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/e3sm_diags_driver.py", line 363, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/os.py", line 225, in\
 makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.histori\
cal_0201/post/scripts/tmp.447022.wCBq/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 92, in run_diags
    move_log_to_prov_dir(params_results[0].results_dir)
UnboundLocalError: local variable 'params_results' referenced before assignment
[WARNING] yaksa: 10 leaked handle pool objects
srun: error: chr-0496: task 0: Exited with exit code 1

Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.

Just an FYI that this run fails in v2.9.0 too, but the log file saves which makes it seems like it was working. UPDATE: Actually I didn't use NCO so it might be failing when I use the test e3sm.py script as a result.

On main, it fails but the log file does not save (UnboundLocalError: local variable 'params_results' referenced before assignment)

v2.9.0 output -- log file saves

2023-12-22 17:10:18,864 [INFO]: run.py(_add_parent_attrs_to_children:152) >> ['diff_title', 'short_test_name', 'num_workers', 'results_dir']
2023-12-22 17:10:25,355 [INFO]: e3sm_diags_driver.py(_save_env_yml:59) >> Saved environment yml file to: model_vs_obs_1850-1851/prov/environment.yml
2023-12-22 17:10:25,356 [INFO]: e3sm_diags_driver.py(_save_parameter_files:70) >> Saved command used to: model_vs_obs_1850-1851/prov/cmd_used.txt
2023-12-22 17:10:25,358 [INFO]: e3sm_diags_driver.py(_save_python_script:134) >> Saved Python script to: model_vs_obs_1850-1851/prov/ipykernel_launcher.py
2023-12-22 17:10:26,954 [ERROR]: e3sm_diags_driver.py(run_diag:296) >> Error in e3sm_diags.driver.qbo_driver
Traceback (most recent call last):
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 293, in run_diag
    single_result = module.run_diag(parameter)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/qbo_driver.py", line 173, in run_diag
    test_var = test_data.get_timeseries_variable(variable)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 98, in get_timeseries_variable
    variables = self._get_timeseries_var(data_path, *args, **kwargs)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 469, in _get_timeseries_var
    vars_to_func_dict = self._get_first_valid_vars_timeseries(
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 560, in _get_first_valid_vars_timeseries
    raise RuntimeError(msg)
RuntimeError: Neither does U nor the variables in [('ua',), ('U',)] have valid files in ts.
2023-12-22 17:10:26,959 [WARNING]: e3sm_diags_driver.py(main:426) >> There was not a single valid diagnostics run, no viewer created.
2023-12-22 17:10:26,960 [ERROR]: run.py(run_diags:37) >> Error traceback:
Traceback (most recent call last):
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/run.py", line 35, in run_diags
    main(final_params)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 445, in main
    if parameters_results[0].fail_on_incomplete and (
IndexError: list index out of range
2023-12-22 17:10:26,965 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in model_vs_obs_1850-1851/prov/e3sm_diags_run.log

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 22, 2023

As a result, in v2.10.0, the results_dir config is not being copied from the first parameter (param) to the second parameter (qbo_param).

Wow, that is a very insidious bug. So, it's just returning p right away because qbo_param is-a param? But we really want it to return p when we get to the exact type?

I will open a separate PR for this and get a new e3sm_diags RC release out.

Awesome, thanks!!

testing these tools together, outside of E3SM Unified.

I absolutely agree. I should update the testing process as follows:

It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.

That's true. https://nvie.com/posts/a-successful-git-branching-model/ suggests merging patches into the user-facing releases AND the latest development branch. That is, only the bug fixes should be getting merged into user-facing code between the non-patch releases; we shouldn't be doing whole new releases of packages according to this particular workflow ideal.

the way e3sm_diags copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex.

I agree it's very convoluted. I haven't studied it enough to know if there's a simpler way to accomplish the same thing. zppy has similar issues of parameter convolution (i.e., parameters being defined in different hierarchical sections of the cfg and some parameters being introduced internally via the <task>.py and the <task>.bash templates. The .settings file at least shows the final values for each parameter a job uses).

@forsyth2
Copy link
Collaborator Author

I want to make sure I'm clear on how param_results is used. It sounds similar to param.fail_on_incomplete.

The latter, when set to True, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.

It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?

@tomvothecoder
Copy link
Collaborator

Wow, that is a very insidious bug. So, it's just returning p right away because qbo_param is-a param? But we really want it to return p when we get to the exact type?

Yeah we only want to return p if it is exact same type and not a sub-class/sub-type.

conda-forge/e3sm_diags-feedstock@79944cd

I want to make sure I'm clear on how param_results is used. It sounds similar to param.fail_on_incomplete.

The latter, when set to True, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.

It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?

params_results is different. It's just a variable containing params but AFTER successful diagnostic runs. If diagnostic runs fail, params_results is never set which causes UnboundLocalError: local variable 'params_results' referenced before assignment. This causes e3sm_diags to crash. param.fail_on_incomplete sounds like it will stop e3sm_diags on the first instance of a failure.

Notice in the code below that params_results has no default value and does not get assigned if the try: statement failed.

        params = self.get_run_parameters(parameters, use_cfg)


        if params is None or len(params) == 0:
            raise RuntimeError(
                "No parameters we able to be extracted. Please "
                "check the parameters you defined."
            )


        try:
            params_results = main(params)
        except Exception:
            logger.exception("Error traceback:", exc_info=True)


        move_log_to_prov_dir(params_results[0].results_dir)


        return params_results

I updated this logic in https://github.com/E3SM-Project/e3sm_diags/pull/770/files so that params_results has a default value (None) and move_log_to_prov_dir() uses params (not params_results because it can be None with failed runs).

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 23, 2023

e3sm_diags v2.10.1rc1 is now released with these fixes: conda-forge/e3sm_diags-feedstock@79944cd

@forsyth2
Copy link
Collaborator Author

Great, thanks @tomvothecoder!

@forsyth2
Copy link
Collaborator Author

Re: Error 2, I made rubisco-sfa/ILAMB#85.

@xylar
Copy link
Contributor

xylar commented Jan 2, 2024

@forsyth2 and @chengzhuzhang, it seems like we probably need to run ilamb-fetch on Chrysalis (or Anvil) in the appropriate directory, probably:

/lcrc/group/e3sm/diagnostics/ilamb_data

See: https://www.ilamb.org/doc/ilamb_fetch.html

@forsyth2
Copy link
Collaborator Author

Re: Error 2, the issue does appear to be from not running ilamb-fetch. See #541

@github-project-automation github-project-automation bot moved this from In Progress to Done in forsyth2 current tasks Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

No branches or pull requests

3 participants