Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

g-w CI C96C48_hybatmaerosnowDA fails on WCOSS2 #1336

Open
RussTreadon-NOAA opened this issue Oct 17, 2024 · 11 comments
Open

g-w CI C96C48_hybatmaerosnowDA fails on WCOSS2 #1336

RussTreadon-NOAA opened this issue Oct 17, 2024 · 11 comments

Comments

@RussTreadon-NOAA
Copy link
Contributor

When g-w CI C96C48_hybatmaerosnowDA is run using g-w PR #2978, the following jobs abort on WCOSS2 (Cactus)

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
202112201200        gdas_aeroanlgenb                   158495445                DEAD                 -29         2        1849.0
202112201800            gdas_snowanl                   158472339                DEAD                   1         2          58.0

gdas_aeranlgenb aborts while executing gdas.x fv3jedi convertstate using chem_convertstate.yaml

nid003614.cactus.wcoss2.ncep.noaa.gov 0: Converting state 1 of 1
nid003614.cactus.wcoss2.ncep.noaa.gov 32:
FATAL from PE    32: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./bkg/20211220.180000.anlres.fv_tracer.res.tile3.nc variable:xaxis_1

nid003614.cactus.wcoss2.ncep.noaa.gov 64:
FATAL from PE    64: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./bkg/20211220.180000.anlres.fv_tracer.res.tile5.nc variable:xaxis_1

gdas_snowanl aborts while executing gdas.x fv3jedi localensembleda using letkfoi.yaml

nid001614.cactus.wcoss2.ncep.noaa.gov 0: Local solver completed.
OOPS_STATS LocalEnsembleDA after solver             - Runtime:     10.48 sec,  Memory: total:     8.43 Gb, per task: min =     1.40 Gb, max =     1.41 Gb
nid001614.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./anl/snowinc.20211220.180000.sfc_data.tile1.nc variable:xaxis_1

nid001614.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Name contains illegal characters: netcdf_add_variable: file:./anl/snowinc.20211220.180000.sfc_data.tile1.nc variable:xaxis_1

The error message is the same for both failures.

These jobs successfully run to completion on Hera, Hercules, and Orion. GDASApp is built with newer intel compilers and different modules on these machines. It is not clear if the older intel/19 compiler or modulefules used on Cactus are the issue or if there is an actual bug in the JEDI code which needs to be fixed.

This issue is opened to document the WCOSS2 failure and its resolution.

@RussTreadon-NOAA
Copy link
Contributor Author

10/18/2024 update

Examine code in sorc/fv3-jedi/src/fv3jedi/IO/FV3Restart. Add prints to IOFms.cc, IOFms.interface.F90, fv3jedi_io_fms2_mod.f90. Create stand-alone script to execute fv3jedi_convertstate.xusing 20211220 12Z gdas_aeroanlgenb input. Reproduceillegal characters` failure above. Prints suggest code is working as intended. Nothing jumps out as being wrong in the code.

Cory found a unidata/netcdf issue reporting illegal characters which appeared to be related to the netcdf version. Beginning to think this may be the issue on WCOSS2. All other platforms build GDASApp with

load("parallel-netcdf/1.12.2")
load("netcdf-c/4.9.2")
load("netcdf-fortran/4.6.1")
load("netcdf-cxx4/4.3.1")

WCOSS2 uses

load("netcdf/4.7.4")

Find

load("netcdf-C/4.9.2")
load("pnetcdf-C/1.12.2")

on WCOSS2 but attempts to build with these have not yet been successful. Still working through various combinations of module versions to see if we can build GDASApp on WCOSS2 using newer netcdf versions.

It would be nice if WCOSS2 had available the same spack-stack used on NOAA RHDPCS machines.

@RussTreadon-NOAA
Copy link
Contributor Author

10/20/2024 update

Unable to find combination of hpc-stack modules to successfully build and/or run gdas.x for either of the failed C96C48_hybatmaerosnowDA jobs. Log into Acorn. Find spack-stack versions 1.6.0, 1.7.0, and 1.8.0. Will use RDHPCS modulefiles to see if we can develop a spack-stack based acorn.intel.lua that allows the failed C96C48_hybatmaerosnowDA jobs to successfully run to completion.

In the interim modify fv3-jedi CMakeLists.txt to make the FMS2_IO build a configurable cmake option via the following changes

@@ -122,7 +122,13 @@ if (NOT FV3_FORECAST_MODEL MATCHES GEOS AND NOT FV3_FORECAST_MODEL MATCHES UFS)
 endif()
 
 # fms
-set(HAS_FMS2_IO TRUE) # Set to FALSE if FMS2 IO unavailable (should be removed eventually)
+option(BUILD_FMS2_IO "Build fv3-jedi with FMS2_IO" ON)
+set(HAS_FMS2_IO TRUE)
+if (NOT BUILD_FMS2_IO)
+   set(HAS_FMS2_IO FALSE)
+endif()
+message("FV3-JEDI built with HAS_FMS2_IO set to ${HAS_FMS2_IO}")
+
 find_package(FMS 2023.04 REQUIRED COMPONENTS R4 R8)
 if (FV3_PRECISION MATCHES DOUBLE OR NOT FV3_PRECISION)
   add_library(fms ALIAS FMS::fms_r8)

FMS2_IO is the default build option. Adding -DBUILD_FMS2_IO=OFF to the GDASApp cmake results in a FMS_IO build.

Do the following:

  1. Make the above change to CMakeLists.txt in a working copy of g-w PR #2978 gdas.cd/sorc/fv3-jedi/CMakeLists.txt.

  2. Add wcoss2 section to GDASApp build.sh to toggle off the FMS2_IO build as shown below

@@ -112,6 +112,11 @@ if [[ $BUILD_TARGET == 'hera' ]]; then
   ln -sf $GDASAPP_TESTDATA/crtm $dir_root/bundle/test-data-release/crtm
 fi
 
+if [[ $BUILD_TARGET == 'wcoss2' ]]; then
+    export BUILD_FMS2_IO="OFF"
+    CMAKE_OPTS+=" -DBUILD_FMS2_IO=${BUILD_FMS2_IO}"
+fi
+
 # Configure
 echo "Configuring ..."
 set -x
  1. Rebuild GDASApp inside working copy of PR #2978 guillaumevernieres:feature/update_hashes

  2. rocotorewind and rocotoboot the failed C96C48_hybatmaerosnowDA jobs. As expected, both jobs successfully run to completion.

Modified CMakeLists.txt committed to NOAA_EMC fv3-jedi branch patch/fv3-jedi at 96dff77 .

If we are OK with the modified CMakeLists.txt approach as a short-term patch, I will update GDASApp branch patch/gwci to point at NOAA-EMC:fv3-jedi at 96dff77. Once this is done the sorc/gdas.cd hash in guillaumevernieres:feature/update_hashes can be updated to pull in this change and g-w C96C48_hybatmaerosnowDA reactivated again in wcoss2.

@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 test

Install guillaumevernieres:feature/update_hashes at e9fa90c on Cactus. Use GDASApp branch patch/gwci at 0325836 with sorc/fv3-jedi pointing at patch/fv3-jedi at 96dff77.

Run g-w CI on Cactus for

  • C96C48_hybatmDA - PSLOT = prgsi_pr2978
  • C96C48_ufs_hybatmDA - PSLOT = prjedi_pr2978
  • C96C48_hybatmaerosnowDA - PSLOT = praero_pr2978
  • C48mx500_3DVarAOWCDA - PSLOT = prwcda_pr2978

with results as follows

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prgsi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    Oct 21 2024 00:26:50    Oct 21 2024 00:40:16
202112210000        Done    Oct 21 2024 00:26:50    Oct 21 2024 02:50:12
202112210600        Done    Oct 21 2024 00:26:50    Oct 21 2024 02:30:18
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prjedi_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Oct 21 2024 00:26:52    Oct 21 2024 00:40:20
202402240000        Done    Oct 21 2024 00:26:52    Oct 21 2024 03:10:08
202402240600        Done    Oct 21 2024 00:26:52    Oct 21 2024 03:15:16
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/praero_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201200        Done    Oct 21 2024 00:26:53    Oct 21 2024 00:45:17
202112201800        Done    Oct 21 2024 00:26:53    Oct 21 2024 01:45:16
202112210000        Done    Oct 21 2024 00:26:53    Oct 21 2024 03:40:16
 
/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202103241200        Done    Oct 21 2024 00:26:55    Oct 21 2024 00:40:26
202103241800      Active    Oct 21 2024 00:26:55             -          

The WCDA failure is the same as before

/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prwcda_pr2978
202103241800         gdas_marinebmat                   158838013                DEAD                 -29         2        1852.0

This failure is not related to changes in g-w PR . This failure also occurs when using GDASApp develop. GDASApp issue #1331 is tracking the WCOSS2 WCDA failure.

@RussTreadon-NOAA
Copy link
Contributor Author

spack-stack update

Commit acorn.intel.lua to GDASApp branch feature/build at f3fe406. acorn.intel.lua began as a copy of hera.intel.lua with paths and modulefiles updated to work on Acorn. Developer queues have been turned off on Acorn due to system work so I can not test & confirm that the executables work. I'll do so after Acorn returns to service.

spack-stack is not yet on WCOSS2 but EIB is in discussions to see this happen.

@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA , @guillaumevernieres , @danholdaway , @DavidNew-NOAA : Are we OK with thefollowing incremental approach?

First,

  1. modify NOAA-EMC:fv3-jedi CMakeLists.txt to make FMS2_IO a cmake configurable option. FMS2_IO is active by default
  2. modify build.sh so that on WCOSS2 we toggle off FMS2_IO and use FMS_IO. This allows C96C48_hybatmaerosnowDA to run to completion.
  3. Update the gdas.cd hash in g-w PR #2978 to bring in the above two changes
  4. Activate C96C48_hybatmaerosnowDA on WCOSS2 in PR #2978

Second,
Once spack-stack is installed on WCOSS2, use the acorn.intel.lua in GDASApp branch feature/build to update wcoss2.intel.lua. Using spack-stack on WCOSS2 will hopefully allow us to run C96C48_hybatmaerosnowDA with FMS2_IO.

If we are OK with the items under First, I'll get to work and make it so.

@DavidNew-NOAA
Copy link
Collaborator

DavidNew-NOAA commented Oct 21, 2024

@RussTreadon-NOAA Fine by me, but FYI NOAA-EMC/global-workflow#2949 will not work on WCOSS when it that PR is merged. The FMS2 IO module in FV3-JEDI also includes non-restart read/write capability which is needed for native grid increments in that PR. Hopefully we sort the FMS2 IO issue out before it goes into review.

This PR won't hold that up, because FMS2 IO isn't working anyway on WCOSS. Like I said, just an FYI.

@RussTreadon-NOAA
Copy link
Contributor Author

@DavidNew-NOAA , does your comment

because FMS2 IO isn't working anyway on WCOSS

refer the fact that ...

  1. select C96C48_hybatmaerosnowDA jobs in g-w PR #2978 fail with FMS2_IO, or
  2. g-w PR #2949 has been tested on WCOSS2 and found to not work

@RussTreadon-NOAA
Copy link
Contributor Author

I don't have a WCOSS2 spack-stack implementation time line, but my guess is that it will not be available on WCOSS2 before g-w PR #2949 is reviewed and merged.

We face a decision for WCOSS2 GDASApp builds:

  1. accept for the time being that at least parts of JEDI aerosol and snow DA do not work on WCOSS2, or
  2. restore JEDI aerosol and snow DA functionality at the expense of breaking the functionality added by g-w PR #2949

Of course, if we can find a combination of existing WCOSS2 modules that work with FMS2_IO, choices 1 and 2 become moot. Thus far, I have not been able to find this combination.

@CoryMartin-NOAA
Copy link
Contributor

My preference, while not ideal, is 2, as we have relatively soon deadlines for aero/snow and not for atm cycling. Do we know for sure it's a library issue?

@RussTreadon-NOAA
Copy link
Contributor Author

Can't say for sure but I studied the fv3-jedi fms2 code in depth on Thu-Fri with lots of prints added. Nothing jumps out as being wrong. The code as_is works fine on Hera, Hercules, and Orion. These machines build GDASApp with newer intel compilers and spack-stack. Hence the hypothesis that the Cactus failures are due to the older intel compiler and/or the hpc-stack modules we load.

Once Acorn queues are opened I can run a build of g-w PR #2978 with GDASApp using spack-stack/1.6.0 (same version we use on NOAA RDHPCS) and see if the failing Cactus jobs run OK.

@DavidNew-NOAA
Copy link
Collaborator

@DavidNew-NOAA , does your comment

because FMS2 IO isn't working anyway on WCOSS

refer the fact that ...

  1. select C96C48_hybatmaerosnowDA jobs in g-w PR #2978 fail with FMS2_IO, or

  2. g-w PR #2949 has been tested on WCOSS2 and found to not work

@RussTreadon-NOAA I assume it atmospheric cycling will not work on WCOSS, because g-w PR #2949 will reintroduce FMS (2). Currently atmospheric cycling uses cubed sphere histories to write increments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants