Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sixth reconciliation PR from production/RRFS.v1 #896

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

grantfirl
Copy link
Collaborator

Description

This PR contains changes from 2 PRs that went into production/RRFS.v1:

Provides the changes needed to take advantage of FMS parallel IO changes. Changes courtesy of Dan Kokron of GDIT. Will be paired with changes in atmos_cubed_sphere

This PR adds a new diagnostic, instantaneous downward shortwave flux at the surface assuming clear-sky conditions.

Issue(s) addressed

None

Testing

Tested on Hera using full rt.conf. See logs in UWM PR: =

Dependencies

ufs-community/ccpp-physics#235
NOAA-GFDL/GFDL_atmos_cubed_sphere#365

Requirements before merging

  • All new code in this PR is tested by at least one unit test
  • All new code in this PR includes Doxygen documentation
  • All new code in this PR does not add new compilation warnings (check CI output)

@grantfirl
Copy link
Collaborator Author

grantfirl commented Jan 3, 2025

@MatthewPyle-NOAA @dkokron When running UFS regression tests with this code, the control_restart_p8_intel test is failing with a segmentation fault. If you have access to Hera, you can see the logs here: /scratch1/BMC/gmtb/Grant.Firl/stmp2/Grant.Firl/FV3_RT/rt_1304460/control_restart_p8_intel

The err log shows the first non-libarary error as:
0x0000000002232a28 fv_io_mod_mp_fv_io_read_restart_() /scratch1/BMC/gmtb/Grant.Firl/ufs-weather-model-grantfirl/FV3/atmos_cubed_sphere/tools/fv_io.F90:495

This seems to be related to the changes that were made in this PR. Could you please help to debug this?

@dkokron
Copy link
Contributor

dkokron commented Jan 3, 2025 via email

@grantfirl
Copy link
Collaborator Author

I don't have access to Hera. Does this failure happen on Acorn too?

On Fri, Jan 3, 2025 at 9:46 AM Grant Firl @.> wrote: @MatthewPyle-NOAA https://github.com/MatthewPyle-NOAA @dkokron https://github.com/dkokron When running UFS regression tests with this code, the control_restart_p8_intel test is failing with a segmentation fault. If you have access to Hera, you can see the logs here: /scratch1/BMC/gmtb/Grant.Firl/stmp2/Grant.Firl/FV3_RT/rt_1304460/control_restart_p8_intel The err log shows the first non-libarary error as: 0x0000000002232a28 fv_io_mod_mp_fv_io_read_restart_() /scratch1/BMC/gmtb/Grant.Firl/ufs-weather-model-grantfirl/FV3/atmos_cubed_sphere/tools/fv_io.F90:495 This seems to be related to the changes that were made in this PR. Could you please help to debug this? — Reply to this email directly, view it on GitHub <#896 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACODV2DOTNIIHTKZXNEHNNL2I2WGXAVCNFSM6AAAAABTCZ24DSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRZGQ2DENRTGI . You are receiving this because you were mentioned.Message ID: @.>

So far, it's been noted on Hera and Hercules. I'm guessing it would also show up on Acorn, but I don't have access.

@dkokron
Copy link
Contributor

dkokron commented Jan 3, 2025

@BrianCurtis-NOAA normally runs the UFS regression suite on acorn. Is he scheduled to do the same for this release?

@grantfirl
Copy link
Collaborator Author

@BrianCurtis-NOAA normally runs the UFS regression suite on acorn. Is he scheduled to do the same for this release?

This PR isn't necessarily associated with a release. It's just bringing changes from the RRFSv1 release branch into the main develop branch. Perhaps we can ask @BrianCurtis-NOAA to run the failing test only on Acorn for you to see if it fails there and you can have access to the run directory with logs?

@dkokron
Copy link
Contributor

dkokron commented Jan 3, 2025

I've run the UFS regression suite before. I was just wondering if Brian had done this already on acorn. Will the following clone get me to a state where I can reproduce the failure?

git clone --recursive [email protected]:ufs-community/ufs-weather-model.git CloneUFSForRegressionTesting

@grantfirl
Copy link
Collaborator Author

I've run the UFS regression suite before. I was just wondering if Brian had done this already on acorn. Will the following clone get me to a state where I can reproduce the failure?

git clone --recursive [email protected]:ufs-community/ufs-weather-model.git CloneUFSForRegressionTesting

I don't know if they had gotten to Acorn yet or not, but they stopped testing once this issue was noticed.

I'd use:
git clone --branch rrfsv1-to-ufs/dev6 --recursive https://github.com/grantfirl/ufs-weather-model.git
since this is the source branch for the PR. I'm sure that there are other ways.

@dkokron
Copy link
Contributor

dkokron commented Jan 3, 2025

Runs fine on acorn.
acorn:/lfs/h1/hpc/support/daniel.kokron/FV3_RT/rt_1846251/control_restart_p8_intel

Please try disabling the ENABLE_PARALLELRESTART feature in your Hera build.

line 39 of FV3/atmos_cubed_sphere/CMakeLists.txt
from
option(ENABLE_PARALLELRESTART "Enable collective parallel reads -DENABLE_PARALLELRESTART" ON)
to
option(ENABLE_PARALLELRESTART "Enable collective parallel reads -DENABLE_PARALLELRESTART" OFF)

@grantfirl
Copy link
Collaborator Author

Runs fine on acorn. acorn:/lfs/h1/hpc/support/daniel.kokron/FV3_RT/rt_1846251/control_restart_p8_intel

Please try disabling the ENABLE_PARALLELRESTART feature in your Hera build.

line 39 of FV3/atmos_cubed_sphere/CMakeLists.txt from option(ENABLE_PARALLELRESTART "Enable collective parallel reads -DENABLE_PARALLELRESTART" ON) to option(ENABLE_PARALLELRESTART "Enable collective parallel reads -DENABLE_PARALLELRESTART" OFF)

Interesting, thanks. I'll try your suggestion on Hera and report back on this PR.

@grantfirl
Copy link
Collaborator Author

@dkokron The offending test on Hera passes with -DENABLE_PARALLELRESTART" OFF.

@dkokron
Copy link
Contributor

dkokron commented Jan 3, 2025

@bensonr
Our agreement with the FMS team was that they would take ownership after they accepted the contributions that enable this feature. How do you want to proceed?

@grantfirl
Copy link
Collaborator Author

grantfirl commented Jan 6, 2025

@dkokron The offending test on Hera passes with -DENABLE_PARALLELRESTART" OFF.

@dkokron @bensonr @jkbk2004 FYI, the error is reproducible on Hera. The test completes successfully with -DENABLE_PARALLELRESTART" OFF and fails with -DENABLE_PARALLELRESTART" ON.

@jkbk2004 @MatthewPyle-NOAA How important is it for #818 to be in the develop branch for the SRW App? I could split this PR into just the CCPP component that should pass/merge and then leave the debugging of #818 for a separate PR if that makes sense to do so.

@MatthewPyle-NOAA
Copy link
Collaborator

@grantfirl I would say this functionality isn't important for the SRW App. It is a pretty WCOSS-targeting change for running the RRFS efficiently there. But can it be included with -DENABLE_PARALLELRESTART OFF?

@grantfirl
Copy link
Collaborator Author

@grantfirl I would say this functionality isn't important for the SRW App. It is a pretty WCOSS-targeting change for running the RRFS efficiently there. But can it be included with -DENABLE_PARALLELRESTART OFF?

Thanks @MatthewPyle-NOAA I'm fine with changing -DENABLE_PARALLELRESTART to OFF in the CMakeLists.txt of atmos_cubed_sphere. If I understand correctly, this is providing somewhat of a default value anyway, and it can be turned on via argument to cmake, so this seems like a good compromise for now. I'm not in charge of this repo though. What do @DusanJovic-NOAA, @BrianCurtis-NOAA, and @jkbk2004 think?

@dkokron @bensonr We can add an issue in atmos_cubed_sphere for someone to debug this later if it is ever necessary for the broken platforms to have this functionality in the future?

@DusanJovic-NOAA
Copy link
Collaborator

@grantfirl I would say this functionality isn't important for the SRW App. It is a pretty WCOSS-targeting change for running the RRFS efficiently there. But can it be included with -DENABLE_PARALLELRESTART OFF?

Thanks @MatthewPyle-NOAA I'm fine with changing -DENABLE_PARALLELRESTART to OFF in the CMakeLists.txt of atmos_cubed_sphere. If I understand correctly, this is providing somewhat of a default value anyway, and it can be turned on via argument to cmake, so this seems like a good compromise for now. I'm not in charge of this repo though. What do @DusanJovic-NOAA, @BrianCurtis-NOAA, and @jkbk2004 think?

@dkokron @bensonr We can add an issue in atmos_cubed_sphere for someone to debug this later if it is ever necessary for the broken platforms to have this functionality in the future?

I agree. Set ENABLE_PARALLELRESTART to OFF by default. Then it can be turned ON only on WCOSS2 for now until the issues are resolved on other platforms.

…lt to avoid RT errors on some platforms (Hera, Hercules at least)
@grantfirl
Copy link
Collaborator Author

@grantfirl I would say this functionality isn't important for the SRW App. It is a pretty WCOSS-targeting change for running the RRFS efficiently there. But can it be included with -DENABLE_PARALLELRESTART OFF?

Thanks @MatthewPyle-NOAA I'm fine with changing -DENABLE_PARALLELRESTART to OFF in the CMakeLists.txt of atmos_cubed_sphere. If I understand correctly, this is providing somewhat of a default value anyway, and it can be turned on via argument to cmake, so this seems like a good compromise for now. I'm not in charge of this repo though. What do @DusanJovic-NOAA, @BrianCurtis-NOAA, and @jkbk2004 think?
@dkokron @bensonr We can add an issue in atmos_cubed_sphere for someone to debug this later if it is ever necessary for the broken platforms to have this functionality in the future?

I agree. Set ENABLE_PARALLELRESTART to OFF by default. Then it can be turned ON only on WCOSS2 for now until the issues are resolved on other platforms.

@DusanJovic-NOAA @dkokron @MatthewPyle-NOAA I've turned it off in NOAA-GFDL/GFDL_atmos_cubed_sphere#365 in order to pass RTs for this PR. I'll add an issue in atmos_cubed_sphere for someone to debug this when there is an opportunity.

@grantfirl
Copy link
Collaborator Author

@grantfirl
Copy link
Collaborator Author

@dkokron @MatthewPyle-NOAA By turning ENABLE_PARALLELRESTART OFF in atmos_cubed_sphere, does it mean that the functionality being introduced into the develop branch is effectively not being tested at all by UFS regression tests? If so, should a regression test be added/augmented to turn this on for a platform that it does work on?

@MatthewPyle-NOAA
Copy link
Collaborator

@grantfirl Good question. Might be nice to have, but feel like there always is a push to reduce the number of regression tests. So not really sure.

@dkokron
Copy link
Contributor

dkokron commented Jan 8, 2025

I think RRFS depends on this feature, so I vote yes on having the feature enabled when the RT suite is run on WCOSS2.

@jkbk2004
Copy link
Collaborator

I think RRFS depends on this feature, so I vote yes on having the feature enabled when the RT suite is run on WCOSS2.

@dkokron do you think that ENABLE_PARALLELRESTART CMake option can be moved to somewhere https://github.com/ufs-community/ufs-weather-model/tree/develop/cmake ? I mean to make the option machine specific.

@dkokron
Copy link
Contributor

dkokron commented Jan 10, 2025

I'm not part of the RRFS team.

@MatthewPyle-NOAA
Copy link
Collaborator

@jkbk2004 Making it a machine specific option (OFF everywhere but for WCOSS2) sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants