Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update weather model, UPP, and UFS_UTILS hashes #1050

Merged
merged 20 commits into from
Mar 27, 2024

Conversation

MichaelLueken
Copy link
Collaborator

@MichaelLueken MichaelLueken commented Mar 6, 2024

DESCRIPTION OF CHANGES:

This PR will update the ufs-weather-model hash to 8518c2c (March 1), the UPP hash to 945cb2c (January 23), and the UFS_UTILS hash to 57bd832 (February 6).

This work also required several modifications to allow the updated weather model and UFS_UTILS hashes to work in the SRW:

  • Update spack-stack to v1.5.1
  • Rename NEMS/nems to UFS/ufs
  • Remove ush/set_ozone_param.py (ozphys scheme in SDFs were removed in the weather model)
  • Update path to noahmptable.tbl
  • Add two new fields to INPS (MASK_ONLY and MERGE_FILE) for make_orog task
  • Make changes to allow for the updated method of finding CRES in chgres_cube

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel (Builds, fundamental, aqm WE2E test)
  • orion.intel (Builds, fundamental)
  • hercules.intel (Builds, fundamental, comprehensive)
  • derecho.intel (Builds, fundamental)
  • gaea.intel (Builds, fundamental)
  • jet.intel (Builds, fundamental, comprehensive)
  • fundamental test suite
  • comprehensive tests (Hercules, Jet)

DEPENDENCIES:

None

DOCUMENTATION:

Documentation in ConfigWorkflow.rst has been updated to show renaming of NEMS/nems to UFS/ufs.

ISSUE:

Fixes #1049

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes

CONTRIBUTORS (optional):

@mkavulich

@mkavulich
Copy link
Collaborator

mkavulich commented Mar 6, 2024

Remove ush/set_ozone_param.py (ozphys scheme in SDFs were removed in the weather model)

For more background information on this point, the stratospheric ozone physics schemes were reorganized (see ufs-community/ufs-weather-model#1851, NOAA-EMC/fv3atm#661, ufs-community/ccpp-physics#75) so that the ozone physics schemes are now controlled by input.nml, where previously they were controlled by both namelist and suite definition file. So any future ozone physics changes will need to be tied to the namelist options: currently the only supported ozone suite is the NRL 2015 ozone scheme (oz_phys_2015 = .true.) so there's no need for any special scheme-specific logic, hence the removal of that file.

@mkavulich
Copy link
Collaborator

mkavulich commented Mar 6, 2024

@MichaelLueken I have run into a problem on Derecho that I'm unable to solve: this occurred both in my preliminary branch and your current branch. It has to do with the installation of the srw conda packages:

error    libmamba Bad conversion of Python version '3.10.12': filesystem error: temp_directory_path: No such file or directory
./Miniforge3-Linux-x86_64.sh: line 339: 109438 Segmentation fault      (core dumped) CONDA_SAFETY_CHECKS=disabled CONDA_EXTRA_SAFETY_CHECKS=no CONDA_CHANNELS="conda-forge" CONDA_PKGS_DIRS="$PREFIX/pkgs" "$CONDA_EXEC" install --offline --file "$PREFIX/pkgs/env.txt" -yp "$PREFIX"
./devbuild.sh: line 228: conda/etc/profile.d/conda.sh: No such file or directory
./devbuild.sh: line 233: conda: command not found
./devbuild.sh: line 234: conda: command not found
./devbuild.sh: line 235: mamba: command not found
./devbuild.sh: line 237: conda: command not found
./devbuild.sh: line 238: mamba: command not found

I assume we'll need to enlist the help of the unified workflow team on this one? Or it could be related to the updated spack-stack build. Regardless, I'll wait to see if you can replicate the problem to make sure it's not just a problem with my environment.

@MichaelLueken
Copy link
Collaborator Author

@mkavulich I have just cloned a fresh copy of the feature/hash_update branch on Derecho and I was able to successfully build the App using ./devbuild.sh -p=derecho:

[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho/exec/ufs_srweather_app.settings
mlueken@derecho6:/glade/derecho/scratch/mlueken/ufs-srweather-app/derecho>

@mkavulich
Copy link
Collaborator

@MichaelLueken Did you check that the conda package installed correctly as well? The code actually builds successfully for me, it's the conda package that fails to install.

@MichaelLueken
Copy link
Collaborator Author

@mkavulich Yes, the conda package was correctly installed. Both the srw_app and srw_graphics conda environments were also created. The fundamental tests were ran and all passed successfully.

My working copy on Derecho can be found - /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho. The fundamental test results can be found /glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs.

@mkavulich
Copy link
Collaborator

mkavulich commented Mar 6, 2024

Thanks for confirming, and sorry for cluttering up the PR with my own issues. I did confirm that it works on Hera, so I'll continue my testing there while I try to figure out this Derecho issue.

Edit: for future reference, this conda error was caused by my environment containing the environment variable TMPDIR which was pointing to a non-existent directory. This is the issue that helped me solve it: conda-forge/miniforge#474

MichaelLueken and others added 3 commits March 11, 2024 13:42
@chan-hoo
Copy link
Collaborator

@MichaelLueken, I was able to build the app on Derecho successfully. However, after yesterday's PM, it fails on Hera with the following message:

Error running link command: Segmentation fault
make[5]: *** [FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/build.make:97: FV3/ccpp/framework/src/libccpp_framework.a] Error 1
make[5]: *** Deleting file 'FV3/ccpp/framework/src/libccpp_framework.a'
make[4]: *** [CMakeFiles/Makefile2:469: FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....

This may be a system issue. Do you have any idea what happens there?

@MichaelLueken
Copy link
Collaborator Author

Thanks for the review, @chan-hoo! At the moment, I suspect that the issue is due to Rocky8 transition. Once PR #1054 is merged, I will update my feature/hash_update branch to the HEAD of develop. Hopefully this is all that should be required.

@MichaelLueken
Copy link
Collaborator Author

@chan-hoo -

I have updated my branch to the HEAD of develop. The SRW is successfully building once again on the default Rocky front ends:

[100%] Completed 'ufs-weather-model'
[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/hera/exec/ufs_srweather_app.settings

Please let me know if you continue to encounter issues while compiling or running on Hera.

@chan-hoo
Copy link
Collaborator

@MichaelLueken, it works well now. :) Thanks!

@MichaelLueken
Copy link
Collaborator Author

The update made to aqm_environment.yml was enough to allow the AQM WE2E test to successfully run on Hera Rocky8:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240320175754                   COMPLETE            4890.58
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4890.58

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for adopting my suggested changes 👍

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 21, 2024
@MichaelLueken
Copy link
Collaborator Author

The Jenkins runner on Hera appears to have connected to a CentOS front end when maintenance concluded. The Jenkins tests on Hera failed to compile the SRW App. Reached out to the Platform Team via PSD-85 to request that they connect to a Rocky8 front end.

Additionally, on Hera and Jet, the Functional WorkflowTaskTests are failing in run_fcst with the following error message:

FATAL from PE 0: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

Further investigation is necessary to see why the tests are running fine on Derecho, Hercules, and Orion, but not on Hera and Jet (Why is PBSpro okay, but not Slurm?). Additionally, why is there an issue with the wrapper scripts, but not when run as part of the workflow?

The WE2E coverage tests were manually launched on Hera Intel and successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240321180137                            COMPLETE              31.36
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024032  COMPLETE               6.84
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE            1515.46
get_from_HPSS_ics_HRRR_lbcs_RAP_20240321180143                     COMPLETE              14.76
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.43
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              14.16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240321180147  COMPLETE              10.66
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.74
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             447.16
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240321  COMPLETE             587.99
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403211  COMPLETE            1024.58
pregen_grid_orog_sfc_climo_20240321180154                          COMPLETE               7.67
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            3675.81

@MichaelLueken
Copy link
Collaborator Author

As part of this PR, I removed the use of the PET list and added back in the original atmos_nthreads capability. This works fine while running the workflow and using the wrapper scripts (Functional WorkflowTaskTests - wrapper_srw_ftest.sh) on systems that use PBSPro, but the wrapper scripts on systems that use Slurm are failing due to bad PET list bounds. Adding back in the PET list allows the wrapper scripts to pass on Slurm systems, but cause the workflow to fail in run_fcst.

I'm now doing a deep dive to see if the PET list aspect of the weather model has been updated without an update to the documentation.

@MichaelLueken
Copy link
Collaborator Author

I had been updating the pe_member01_m1 entry in ufs.configure, rather than updating the PE_MEMBER01 entry in config_defaults.yaml. This was leading to the correct output in the ufs.configure file, but an incorrect value for PE_MEMBER01, leading to failure. Applying the necessary update to PE_MEMBER01 is now allowing the majority of the fundamental tests to properly run.

Unfortunately, the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR WE2E test is still failing. The failure message is:

FATAL from PE 6: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

It appears as though the modification to DT_ATMOS, LAYOUT_X, LAYOUT_Y, or BLOCKSIZE is having an adverse effect on the test. Will investigate further.

@MichaelLueken
Copy link
Collaborator Author

The Jenkins tests successfully passed. After addressing conflicts, the AQM WE2E test was ran one last time and also successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240327143007                   COMPLETE            4865.32
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4865.32

Merging this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The weather model hash should be updated to the latest main branch
3 participants