Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize locations and directory names for staged input data #231

Closed
mkavulich opened this issue Mar 25, 2022 · 6 comments · Fixed by #251
Closed

Standardize locations and directory names for staged input data #231

mkavulich opened this issue Mar 25, 2022 · 6 comments · Fixed by #251
Labels
enhancement New feature or request

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Mar 25, 2022

Description

We currently have a wide variety of directory names for storing static model input data for the UFS SRW, both static ("fix") input data for all cases, as well as pre-staged input and boundary condition data for workflow end-to-end (WE2E) tests. The variables that point to the various static and case-specific data, which are inhomogeneous across platforms, are:

  • FIXgsm (where most "fix" files are located)
  • FIXaer (MERRA2 aerosol climatology files)
  • FIXlut (lookup tables for optics properties)
  • TOPO_DIR (static input files for the make_orog task)
  • SFC_CLIMO_INPUT_DIR (static surface climatology input fields for sfc_climo_gen)
  • FIXLAM_NCO_BASEDIR (base directory containing pregenerated grid, orography, and surface climatology files)
  • TEST_PREGEN_BASEDIR (used in WE2E tests only; points to a "custom" location on disk, testing the ability for users to stage their own grid, orography, and surface climatology files)
  • TEST_COMIN (used in WE2E tests only; points to input for boundary and initial conditions specifically for cases using old GFS spectral model input)
  • TEST_EXTRN_MDL_SOURCE_BASEDIR (used in WE2E tests only; points to a "custom" location on disk, testing the ability for users to stage their own data)

We should standardize these across the tier-1 platforms, as this will provide many benefits:

  1. Easier to know where files should be found on any given platform, or where it should be staged on a new platform
  2. Easier to see which data is "official" and appropriate for the version of the code you are using
  3. Easier to update across all platforms when necessary
  4. We could unify the default value for most of the variables above to cut down on machine-specific code

In addition, we should maintain separate directories for "release" static data sets, which will ensure that changes on disk will not impact results for users of released code. Currently some platforms have implemented this for the v1.0 (e.g. /scratch2/BMC/det/UFS_SRW_App/v1p0/ on Hera) but it has not been done uniformly across platforms, and has not been done for all static/input data.

Solution

Here is the proposed solution:

  • Deprecate and remove the variables COMINgfs, TEST_COMINgfs because they duplicate the functionality of EXTRN_MDL_SOURCE_BASEDIR and TEST_EXTRN_MDL_SOURCE_BASEDIR This naming convention (now COMIN, see Variable forecast length broken when cycles start at later times. #743) is required by NCO standards, so will be left in for now.

  • rename FIXLAM_NCO_BASEDIR to DOMAIN_PREGEN_BASEDIR (since these are domain-specific but not case-specific files, and they are not specific to NCO cases)

  • All static data for each tier 1 platform will be stored underneath the top-level "UFS_SRW_App" directory

    • e.g. /scratch2/BMC/det/UFS_SRW_App/ on Hera, /glade/p/ral/jntp/UFS_SRW_App/ on Cheyenne, etc.
  • Any static data that may change over time for different releases should be versioned in a subdirectory.

    • e.g., static data for v1.0 should be under UFS_SRW_App/v1p0, v2.0 should be under UFS_SRW_App/v2p0, and "live" data for use with the top of develop should be under UFS_SRW_App/develop
  • "Fix" files (generic static input) will be stored under a "fix" subdirectory (FIX_DIR)

    • i.e. UFS_SRW_App/v1p0/fix, UFS_SRW_App/develop/fix, etc.
    • Each specific type of "fix" file should be in a subdirectory of FIX_DIR
      • FIXgsm = $FIX_DIR/fix_am (I am not sure what "fix_am" stands for, but it is the name of this directory for all platforms currently)
      • FIXaer = $FIX_DIR/fix_aer
      • FIXlut = $FIX_DIR/fix_lut
      • TOPO_DIR = $FIX_DIR/fix_orog
      • SFC_CLIMO_INPUT_DIR = $FIX_DIR/fix_sfc_climo
  • DOMAIN_PREGEN_BASEDIR and TEST_PREGEN_BASEDIR will point to a "FV3LAM_pregen" subdirectory

    • i.e. UFS_SRW_App/v1p0/FV3LAM_pregen, UFS_SRW_App/develop/FV3LAM_pregen, etc.
  • TEST_EXTRN_MDL_SOURCE_BASEDIR will be stored under an "input_model_data" directory

    • i.e. UFS_SRW_App/v1p0/input_model_data, UFS_SRW_App/develop/input_model_data, etc.

Directories and/or files in develop or a release directory may be symbolic links pointing to a previous "release" directory (to avoid data duplication), but may not be the other way around (which may lead to changing of input for released code, and unexpected changing of results with the same code).

Alternatives

Suggestions are welcome; I'm not particularly tied to any one solution, we just need to converge on a single solution across platforms. In addition, if some of the "duplicate" variables I pointed out need to remain independent, let me know why and I can fit them into the above proposed heirarchy.

@mkavulich mkavulich added the enhancement New feature or request label Mar 25, 2022
@willmayfield
Copy link
Collaborator

Related to ufs-community/regional_workflow#471.

@jwolff-ncar
Copy link
Collaborator

Thanks for writing up this issue in such a detailed manner. I think consolidating in this way would enhance ease of use going forward. We welcome others to chime in with concerns or suggestions.

@gsketefian
Copy link
Collaborator

@mkavulich I am all for this reorganization:

  • "Fix" files (generic static input) will be stored under a "fix" subdirectory (FIX_DIR)
    • i.e. UFS_SRW_App/v1p0/fix, UFS_SRW_App/develop/fix, etc.
    • Each specific type of "fix" file should be in a subdirectory of FIX_DIR
      • FIXgsm = $FIX_DIR/fix_am (I am not sure what "fix_am" stands for, but it is the name of this directory for all platforms currently)
      • FIXaer = $FIX_DIR/fix_aer
      • FIXlut = $FIX_DIR/fix_lut
      • TOPO_DIR = $FIX_DIR/fix_orog
      • SFC_CLIMO_INPUT_DIR = $FIX_DIR/fix_sfc_climo

The difficult part (I think) is that we share these directories with the global model, so they will have to do the same. @JacobCarley-NOAA is that correct?

@mkavulich
Copy link
Collaborator Author

mkavulich commented Apr 4, 2022

@gsketefian it is unclear to me how these fix files and directories are currently managed. On Cheyenne we are pointing to our own copies of the fix directories, so at least on that platform we are not uniform with how the global fix files are handled. On all platforms the fix subdirectories are the same, so it could be handled with a symbolic link on platforms where the global data is staged by the global group. But I don't know if this is the best way to go about it.

@mkavulich
Copy link
Collaborator Author

In addition to the proposed changes above, I have also standardized the way that model input data is stored in its various subdirectories. The following text is also stored in a README file on disk in the input_model_data directory:

This file documents the directory structure and file naming conventions for the model input data in
this directory, for use with the UFS SRW App.

The FV3GFS directory contains output data from the GFS model v15 or later (using the FV3 dynamical
core). There are multiple subdirectories which separate the different file formats that FV3GFS data
can be provided in: grib2, nemsio, and netcdf (not currently used)

Under the subdirectory for file format, the files are separated by the initial forecast time of the
input data. For example, the directory 2019061500 contains data from a GFS forecast initialized at
00z on 20190615; shortly after GFS v15 was released.

The filenames are the same as found on HPSS data stores. For grib2, the format is

gfs.t{hh}z.pgrb2.0p25.f{fhr}

where {hh} is the 2-digit UTC hour of forecast initialization, and {fhr} is the 3-digit forecast
hour. For nemsio, the format is different depending on whether the file is from the initial
time or a forecast output.

Initial conditions:
gfs.t{hh}z.atmanl.nemsio (atmospheric fields)
gfs.t{hh}z.sfcanl.nemsio (surface fields)

Forecasts:
gfs.t{hh}z.atmf{fhr}.nemsio

FV3GFS
  grib2
    2019061500
      gfs.t00z.pgrb2.0p25.f000
      gfs.t00z.pgrb2.0p25.f003
      gfs.t00z.pgrb2.0p25.f006
      gfs.t00z.pgrb2.0p25.f009
      gfs.t00z.pgrb2.0p25.f012
      gfs.t00z.pgrb2.0p25.f015
      gfs.t00z.pgrb2.0p25.f018
      gfs.t00z.pgrb2.0p25.f021
      gfs.t00z.pgrb2.0p25.f024
      gfs.t00z.pgrb2.0p25.f027
      gfs.t00z.pgrb2.0p25.f030
      gfs.t00z.pgrb2.0p25.f033
      gfs.t00z.pgrb2.0p25.f036
      gfs.t00z.pgrb2.0p25.f039
      gfs.t00z.pgrb2.0p25.f042
      gfs.t00z.pgrb2.0p25.f045
      gfs.t00z.pgrb2.0p25.f048
    2019061518
      gfs.t18z.pgrb2.0p25.f000
      gfs.t18z.pgrb2.0p25.f006
    2020081000
      gfs.t00z.pgrb2.0p25.f000
  nemsio
    2019070100
      gfs.t00z.atmanl.nemsio
      gfs.t00z.atmf003.nemsio
      gfs.t00z.atmf006.nemsio
      gfs.t00z.sfcanl.nemsio
    2019070112
      gfs.t12z.atmanl.nemsio
      gfs.t12z.atmf003.nemsio
      gfs.t12z.atmf006.nemsio
      gfs.t12z.sfcanl.nemsio
    2019070200
      gfs.t00z.atmanl.nemsio
      gfs.t00z.atmf003.nemsio
      gfs.t00z.atmf006.nemsio
      gfs.t00z.sfcanl.nemsio
    2019070212
      gfs.t12z.atmanl.nemsio
      gfs.t12z.atmf003.nemsio
      gfs.t12z.atmf006.nemsio
      gfs.t12z.sfcanl.nemsio

The GSMGFS directory contains output data from the GFS model v14 or earlier (the GFS spectral
model). Only nemsio format files are available, so there are no file format subdirectories.

GSMGFS
  2019052000
    gfs.t00z.atmanl.nemsio
    gfs.t00z.atmf006.nemsio
    gfs.t00z.sfcanl.nemsio

The HRRR directory contains output data from the HRRR model. The file formats are based on the
Julian day convention for files stored on the Jet machine:

{yy}{jjj}{hh}{ffhr}{mm}

where {yy} is the 2-digit year, {jjj} is the Julian day, {hh} is the 2-digit UTC hour, and {ffhr}
is the 4-digit forecast hour.

HRRR
  2020080100
   2021400000000
  2020081000
   2022300000000
   2022300000300
   2022300000600
   2022300000900
   2022300001200
   2022300001500
   2022300001800
   2022300002100
   2022300002400

NAM data is in grib2 format, and follows a standard naming convention from HPSS stores:

nam.t{hh}z.awphys{fh}.tm00.grib2

where {hh} is the 2-digit UTC hour, {fh} is the 2-digit forecast hour.

NAM
  2021061500
    nam.t00z.awphys00.tm00.grib2
    nam.t00z.awphys03.tm00.grib2
    nam.t00z.awphys06.tm00.grib2

RAP input data, unlike all others, has a default "EXTRN_MDL_LBCS_OFFSET_HRS" of 3 in the UFS SRW
App; this means that the top-level date directory (indicating the initial time of the forecast
which created the input data) is 3 hours earlier than the initial time of the UFS SRW forecast.

As with HRRR data, RAP data utilizes the julian date format for its output files
standard on the Jet platform.

{yy}{jjj}{hh}{ffhr}{mm}

where {yy} is the 2-digit year, {jjj} is the Julian day, {hh} is the 2-digit UTC hour, and {ffhr}
is the 4-digit forecast hour.

RAP
  2020073121
    2021321000600
    2021321000900
  2020080921
    2022221000400
    2022221000500
    2022221000600
    2022221000700
    2022221000800
    2022221000900
    2022221001200
    2022221001500
    2022221001800
    2022221002100
    2022221002400
    2022221002700

@mkavulich
Copy link
Collaborator Author

A few notes on model input data consolidation:

  • Currently, no WE2E tests using staged data test FV3GFS data in netCDF format. So this subdirectory has not been created in the new input_model_data directory, but could easily be created in the future as new tests are added.
  • Custom model-based filenames have been eliminated in favor of standard model output names, as described in the above comment
  • The new input_model_data directory has been greatly reduced in size by eliminating unused test data, from 777 GB to 222 GB (not intentionally round numbers, but pretty cool!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants