Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sixth reconciliation PR from production/RRFS.v1 #2529

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

grantfirl
Copy link
Collaborator

@grantfirl grantfirl commented Dec 5, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR moves two PRs from production/RRFS.v1 to the develop branch:

#2226 from @MatthewPyle-NOAA

Provides a set of changes changes needed to take advantage of FMS parallel IO changes. Changes courtesy of Dan Kokron of GDIT.

#2249 from @dustinswales

This PR adds a new diagnostic, instantaneous downward shortwave flux at the surface assuming clear-sky conditions.

Commit Message:

* UFSWM - Provides a set of changes changes needed to take advantage of FMS parallel IO changes; adds a new diagnostic, instantaneous downward shortwave flux at the surface assuming clear-sky conditions
  * FV3 - Provides a set of changes changes needed to take advantage of FMS parallel IO changes; adds a new diagnostic, instantaneous downward shortwave flux at the surface assuming clear-sky conditions
    * ccpp-physics - adds a new diagnostic, instantaneous downward shortwave flux at the surface assuming clear-sky conditions
    * atmos_cubed_sphere - Provides a set of changes changes needed to take advantage of FMS parallel IO changes

Priority:

  • High: Needed for SRW App

Git Tracking

UFSWM:

  • None

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@grantfirl
Copy link
Collaborator Author

Note: I pushed the Hera RT log for the tests that I ran. All tests succeeded except for 2 time-outs. I did not re-run those, but they are not expected to fail if they are re-run. Perhaps they need more wallclock time added?

@jkbk2004
Copy link
Collaborator

@grantfirl Sorry for delay! as wcoss2 maintenance continues, we may pursue baseline change PRs first. can you sync up this pr? CDEPS update PR #2538 has no baseline change as well. It's good to combine. Just need to point to https://github.com/NickSzapiro-NOAA/CDEPS/tree/sync_escomp_2024-12-16

@grantfirl
Copy link
Collaborator Author

grantfirl commented Dec 18, 2024

@grantfirl Sorry for delay! as wcoss2 maintenance continues, we may pursue baseline change PRs first. can you sync up this pr? CDEPS update PR #2538 has no baseline change as well. It's good to combine. Just need to point to https://github.com/NickSzapiro-NOAA/CDEPS/tree/sync_escomp_2024-12-16

@jkbk2004 OK, everything has been synced and #2538 has been combined into this. It should be ready to go.

@BrianCurtis-NOAA BrianCurtis-NOAA added No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Dec 18, 2024
@jkbk2004
Copy link
Collaborator

A few cases with timeout issues across machines: not major problem. But on orion/hercules, control_restart_p8_intel failed in run_test

  71: zstd: Src size is incorrect
 71:
 71: FATAL from PE    71: NetCDF: HDF error: netcdf_read_data_3d: file:INPUT/fv_core.res.tile3.nc- variable:u
 71:
  0:
  0: FATAL from PE     0: NetCDF: HDF error: netcdf_read_data_3d: file:INPUT/fv_core.res.tile1.nc- variable:u

@grantfirl
Copy link
Collaborator Author

https://github.com/NOAA-EMC/fv3atm/pull/896/files#diff-ff4a849e50baa285665db70718c46db9524ddbcce15d0c6795c8078dee38195d

A few cases with timeout issues across machines: not major problem. But on orion/hercules, control_restart_p8_intel failed in run_test

  71: zstd: Src size is incorrect
 71:
 71: FATAL from PE    71: NetCDF: HDF error: netcdf_read_data_3d: file:INPUT/fv_core.res.tile3.nc- variable:u
 71:
  0:
  0: FATAL from PE     0: NetCDF: HDF error: netcdf_read_data_3d: file:INPUT/fv_core.res.tile1.nc- variable:u

@jkbk2004 @MatthewPyle-NOAA I don't think that the CCPP changes could cause this error, but perhaps the changes from #2226 could? @MatthewPyle-NOAA Have you come across a similar error before during any of your testing for the original PR?

@jkbk2004 jkbk2004 removed the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Dec 19, 2024
@grantfirl
Copy link
Collaborator Author

@jkbk2004 I've asked @MatthewPyle-NOAA and @dkokron to help debug the issue since I believe that the error is coming from the parallel FMS IO changes that they introduced.

@grantfirl
Copy link
Collaborator Author

grantfirl commented Jan 7, 2025

@jkbk2004 This PR should be ready to be worked on again. We've effectively "side-stepped" the error for now, but this was deemed OK by @MatthewPyle-NOAA and @DusanJovic-NOAA.

Plus, an issue was added in atmos_cubed_sphere to fix the problem in the future, if necessary on the failing platforms: NOAA-GFDL/GFDL_atmos_cubed_sphere#368

@grantfirl
Copy link
Collaborator Author

On the issue, @bensonr posits that the issue on the specific platforms could be related to lack of memory: NOAA-GFDL/GFDL_atmos_cubed_sphere#368 (comment)

@jkbk2004
Copy link
Collaborator

On the issue, @bensonr posits that the issue on the specific platforms could be related to lack of memory: NOAA-GFDL/GFDL_atmos_cubed_sphere#368 (comment)

@grantfirl worth to increase resource? I will try over weekend.

@bensonr
Copy link
Contributor

bensonr commented Jan 10, 2025

@jkbk2004 - as mentioned in the issue, it could be lack of proper access to the given resources as well. The issue mentions ensuring the user profile is setting the stacksize correctly at shell instantiation (in the startup files).

@grantfirl
Copy link
Collaborator Author

On the issue, @bensonr posits that the issue on the specific platforms could be related to lack of memory: NOAA-GFDL/GFDL_atmos_cubed_sphere#368 (comment)

@grantfirl worth to increase resource? I will try over weekend.

I would think so, just to get to the bottom of the issue. Thanks for trying over the weekend. I'm not an expert in setting the platform RT environments, so the help is much appreciated. I still think that if the functionality is really only needed for RRFS on WCOSS that it makes sense to target it as such if possible, although perhaps it would benefit all platforms if it is made to work? It sounds like enabling more parallel IO may speed up some tests at the expense of using more memory? I have no idea if that tradeoff is desired or not.

@jkbk2004
Copy link
Collaborator

@grantfirl Increasing resource for the control_restart_p8_intel failure doesn't help. I think we may need to go with adding an machine specific option to compile.sh

CMAKE_FLAGS+="  -DENABLE_PARALLELRESTART=OFF or ON"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants