Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Adding Hercules as a Tier-1 platform #911

Merged
merged 10 commits into from
Sep 25, 2023

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Sep 12, 2023

Modulefiles and other configuration files to adapt the SRW to Hercules system at MSU.

Software stacks used for testing are hdf5/1.14.0, netcdf/4.9.2-based, similar to those used in #889.

All fundamental tests successfully pass. All but one comprehensive tests pass, from the suite comprehensive.orion; failed nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

Log files attached

DESCRIPTION OF CHANGES:

Add Hercules at MSU as a NOAA RDHPCS supported system

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

DEPENDENCIES:

Depends on #889

DOCUMENTATION:

ISSUE:

Fixes issue #885

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

WE2E_summary_20230916224448.txt
WE2E_summary_hercules_community.txt
WE2E_summary_hercules_comprehensive.txt

@MichaelLueken MichaelLueken added enhancement New feature or request help wanted Extra attention is needed Work in Progress labels Sep 12, 2023
@MichaelLueken MichaelLueken changed the title Adding Hercules as a Tier-1 platform [develop] Adding Hercules as a Tier-1 platform Sep 12, 2023
@MichaelLueken MichaelLueken linked an issue Sep 12, 2023 that may be closed by this pull request
@natalie-perlin natalie-perlin removed the help wanted Extra attention is needed label Sep 15, 2023
@natalie-perlin
Copy link
Collaborator Author

All fundamental tests pass on Hercules:

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              12.44
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              16.60
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               9.40
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              17.87
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              25.82
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              18.67
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              27.87
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             128.67

Detailed summary written to /work/noaa/epic/nperlin/hercules/SRW/expt_dirs/WE2E_summary_20230916224448.txt

@natalie-perlin
Copy link
Collaborator Author

Comprehensive tests outcomes:
(community test initially failed, but was rerun successfully at a later time, as shown below)

All 62 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD                                                           COMPLETE              32.10
community                                                          DEAD                   0.59
custom_ESGgrid                                                     COMPLETE              13.34
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              20.88
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              13.91
custom_ESGgrid_NewZealand_3km                                      COMPLETE              43.26
custom_ESGgrid_Peru_12km                                           COMPLETE              13.94
custom_ESGgrid_SF_1p1km                                            COMPLETE             135.95
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE               5.56
custom_GFDLgrid                                                    COMPLETE               6.57
deactivate_tasks                                                   COMPLETE               0.92
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             632.56
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE               8.79
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE               9.98
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   COMPLETE             206.73
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        COMPLETE             111.77
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             140.37
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              22.48
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              28.79
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              27.07
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              25.99
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               7.99
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              15.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              25.36
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              49.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              26.54
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              12.12
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               7.48
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              26.95
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              22.29
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             198.59
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             281.14
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             275.34
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             348.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             315.97
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              25.33
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              23.69
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              22.80
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              12.26
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              23.89
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              10.95
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             251.69
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             289.50
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             292.28
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE              65.07
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              19.67
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              20.91
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              19.10
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              28.05
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE              10.49
MET_ensemble_verification_only_vx                                  COMPLETE               0.61
MET_verification_only_vx                                           COMPLETE               0.11
nco                                                                COMPLETE              12.38
nco_ensemble                                                       COMPLETE              77.50
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              25.80
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              15.57
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             281.19
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       DEAD                   1.30
pregen_grid_orog_sfc_climo                                         COMPLETE               8.40
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE               6.72
specify_template_filenames                                         COMPLETE               8.11
----------------------------------------------------------------------------------------------------

community test:

All 1 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              27.64
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              27.64

@BruceKropp-Raytheon
Copy link
Collaborator

@natalie-perlin - need to add 'hercules' as a valid machine to ./tests/build.sh (line 24)

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - Please update this branch to the latest HEAD of develop and address the conflicts in etc/lmod-setup.sh and ush/valid_param_vals.yaml. Please let me know if you would like any assistance with this. Thanks!

@MichaelLueken
Copy link
Collaborator

Another topic specific for this PR is which WE2E tests to run on Hercules. For testing purposes, after updating /tests/build.sh to allow the SRW App to build using the Jenkins build script (.cicd/scripts/srw_build.sh), I renamed the tests/WE2E/machine_suites/coverage.cheyenne.gnu to coverage.hercules, since this suite of coverage tests aren't being run directly via Jenkins as of this time. The coverage tests were then run using the Jenkins test script (.cicd/scripts/srw_test.sh). The coverage tests successfully passed on Hercules:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE               9.68
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              10.93
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              30.78
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              18.27
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              26.97
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              63.40
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              12.86
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE              65.68
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              28.59
MET_verification_only_vx                                           COMPLETE               0.12
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE               8.92
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             276.20

A set of coverage tests will need to be added before the Jenkins label can be added to this PR.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - please let me know if any more updates are needed!

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

As @BruceKropp-Raytheon noted in his comment, the SRW App will not build in the Jenkins pipeline unless hercules is added to the list of machines on line 24 of tests/build.sh.

Since there is no coverage.hercules test suite in tests/WE2E/machine_suites, no tests will be run on Hercules. For the time being, at least, it would probably be a good idea to rename the coverage.cheyenne.gnu test suite to coverage.hercules, so that these WE2E tests are once again ran in the Jenkins pipeline.

With these two modifications, I would be able to give my approval on these changes.

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken - done!

@MichaelLueken
Copy link
Collaborator

Thanks, @natalie-perlin! I don't see the update for tests/build.sh though. This will still need to be done as well.

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passed on Hercules.
As @MichaelLueken said, need to add hercules to the list of machines in tests/build.sh, line 24.

@natalie-perlin
Copy link
Collaborator Author

... Double-checked to verify the changes in ./tests/build.sh got recorded in GitHub!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - Thank you very much for adding hercules to the list of valid machines to use the tests/build.sh build script on! I was able to successfully build the SRW on Hercules using the Jenkins build method. The Jenkins test script was also tested and successfully ran the coverage.hercules test suite. Approving the work now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Sep 22, 2023
@BruceKropp-Raytheon
Copy link
Collaborator

@natalie I can confirm that adding hercules to ./tests/build.sh has allowed automated builds. Thank you.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - An issue was encountered on Hercules that caused the Jenkins tests to fail to clone the repository on the machine. PSD-41 was opened with the Platform team to see if they can see what happened during the Initialize stage on Hercules. Just wanted to give you a head's up.

@MichaelLueken
Copy link
Collaborator

The Derecho WE2E coverage tests were manually run on Derecho and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.16
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              34.68
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              41.76
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              25.61
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              19.92
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.79
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.11
pregen_grid_orog_sfc_climo                                         COMPLETE              13.49
specify_template_filenames                                         COMPLETE              13.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             231.20

@natalie-perlin
Copy link
Collaborator Author

@MichaelLueken -
Cloning repository requires python2, which is no longer available on Hercules, as python2 support is discontinued. Loading any python module solves this problem. Which script does "module load python" needs to be added in the ./.cicd/ to make python(2) available to clone the SRW?
Similar problem may exist for Gaea c5 as well.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - The issue with the Jenkins tests on Hercules is that the same location is used to run the WE2E tests for both Orion and Hercules. Orion ultimately takes priority, causing the Hercules testing to fail to clone the repository. Reaching out the Platform team, they told me that running the Hercules test separately will allow the testing to successfully complete. It appears as though the epic account on Hercules is already set up to include loading a version of python.

The Jenkins tests on Hercules successfully built the SRW and ran the coverage WE2E tests:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE               9.04
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              11.70
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              30.54
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              18.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              26.24
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              51.78
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              13.48
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE              65.57
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              30.12
MET_verification_only_vx                                           COMPLETE               0.12
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE               8.87
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             266.39

Moving forward, a second job will need to be queued manually in the pipeline, in order to run the WE2E coverage tests on Hercules (until the Platform team finds a better method, either through the use of 'dir' in the Jenkinsfile or possible updates to the Jenkins runner on either Orion or Hercules).

I can move forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit 87dbf19 into ufs-community:develop Sep 25, 2023
@natalie-perlin natalie-perlin deleted the develop_hercules branch October 13, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Hercules to supported platforms, as Tier-1 system
4 participants