Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GP ORTs #946

Merged
merged 15 commits into from
Dec 9, 2021
Merged

Add GP ORTs #946

merged 15 commits into from
Dec 9, 2021

Conversation

dustinswales
Copy link
Collaborator

@dustinswales dustinswales commented Dec 6, 2021

Attention

This PR was merged with the wrong fv3atm submodule pointer hash. The hash referenced in this branch (cf0a73180b2d9ac55ebfce4785a7270d205423db) resides/d in @dustinswales's fv3atm fork. The correct fv3atm hash in the NOAA-EMC repository would have been 86d4bb3.

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

This PR contains new operational regression tests for RRTMGP in physics prototype 7.

Issue(s) addressed

See NCAR/ccpp-physics#782 for description of issues.

Testing

Testing at commit b707db5 on Cheyenne against baseline develop-20211203 shows the following:

GNU: all jobs ran to completion. The following two tests failed comparison (as expected):

control_rrtmgp
control_rrtmgp_debug

INTEL: All non-RRTMGP tests ran to completion and passed. The following test ran to completion but failed comparison (as expected):

control_rrtmgp_debug

The following jobs compiled but failed at runtime:

cpld_control_p7_rrtmgp
control_p7_rrtmgp
regional_RRTMGP
control_rrtmgp
control_rrtmgp_c192

These were tested on Intel and gnu on Hera.
This PR contains new tests to be added to tests/rt.conf.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

@DeniseWorthen
Copy link
Collaborator

Please update to the last commit for FV3 and ufs-weather.

@dustinswales
Copy link
Collaborator Author

@DeniseWorthen
Should be all up-to-date now


export FV3_RUN=control_run.IN
export CCPP_SUITE=FV3_GFS_v16_p7_rrtmgp
export NEW_DIAGTABLE=diag_table_gfsv16_merra2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be consistent with the newest control_p7 test (NEW_DIAGTABLE-> DIAGTABLE)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@DeniseWorthen
Copy link
Collaborator

Could you update your PR information for the CCPP-physics PR? It is showing as closed.

Also, is there an issue in either FV3 or CCPP that you can list in the 'issues addressed' section?

@dustinswales
Copy link
Collaborator Author

@DeniseWorthen
I updated the ccpp-physics link in the PR.
There are no open issues to reference.

@MinsukJi-NOAA
Copy link
Contributor

MinsukJi-NOAA commented Dec 6, 2021

I am not able to build control_p7_rrtmgp and cpld_control_p7_rrtmgp with the following errors:

1403:/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/FV3/ccpp/data/GFS_typedefs.F90(2109): error #6459: This field name has been defined more than once within the same structure.   [SCMPSW]
1406:/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/FV3/ccpp/data/GFS_typedefs.F90(2109): error #6227: This symbol has multiple POINTER statement/attribute declarations which is not allowed.   [SCMPSW]
1409:/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/FV3/ccpp/data/GFS_typedefs.F90(2028): error #7367: The data value NULL() can only be assigned to a Fortran POINTER.
1412:/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/FV3/ccpp/data/GFS_typedefs.F90(2109): error #7367: The data value NULL() can only be assigned to a Fortran POINTER.
1415:/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/FV3/ccpp/data/GFS_typedefs.F90(7570): error #6158: The structure-name is invalid or is missing.   [INTERSTITIAL]
1419:make[2]: *** [FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/__/data/GFS_typedefs.F90.o] Error 1
1422:make[1]: *** [FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/all] Error 2
1424:make: *** [all] Error 2

@dustinswales
Copy link
Collaborator Author

@MinsukJi-NOAA
Sorry, there were multiple declarations/allocations of Interstitial%scmpsw in GFS_typedefs.F90.
(Not sure why the merge didn't catch this, but should be good to go now)

@MinsukJi-NOAA
Copy link
Contributor

@dustinswales Could you grab the ORT results for control_p7_rrtmgp (/scratch2/NCEPDEV/stmp1/Minsuk.Ji/PR946_2/tests/OpnReqTests_hera.intel.log) and commit? ORT for cpld_control_p7_rrtmgp cannot be run at this time due to wave component.

@dustinswales
Copy link
Collaborator Author

@MinsukJi-NOAA
NP. done

@DeniseWorthen DeniseWorthen added the Baseline Updates Current baselines will be updated. label Dec 6, 2021
@DeniseWorthen
Copy link
Collaborator

Please update BL_DATE in rt.sh to 20211206.

@climbfuji
Copy link
Collaborator

The weird thing is that the same setup runs in DEBUG mode, but not in PROD or REPRO mode. I'll use the parallel debugger tomorrow morning to see why it is failing in that place.

When I run this through allinea ddt, I get weird values for gptS, see attached screenshot.

Screen Shot 2021-12-08 at 6 38 14 AM

@RobertPincus
Copy link

@DomHeinzeller That is certainly an error. It suggests corrupted memory, I guess.

@BrianCurtis-NOAA
Copy link
Collaborator

@DomHeinzeller is -fsanitize=address set in the compiler options? that may help? Although I've seen weird integers like that even when the build passes with sanitize.

@climbfuji
Copy link
Collaborator

@DomHeinzeller is -fsanitize=address set in the compiler options? that may help? Although I've seen weird integers like that even when the build passes with sanitize.

We don't have this compiler flag yet, will try manually.

This is the screenshot for exactly the same run, same place, when compiled in DEBUG mode. The values look good.

Screen Shot 2021-12-08 at 7 34 35 AM

@DeniseWorthen
Copy link
Collaborator

@dustinswales would you please merge Dom's PR to update the rt.conf? Thanks.

@climbfuji
Copy link
Collaborator

climbfuji commented Dec 8, 2021

@dustinswales I have a little more info. The repro mode is apparently broken due to previous build system updates (recall that we don't really test the repro mode ... a flaw). I run in prod mode, I get more information on the variables. I stopped the model in DDT before the offending line. The screenshot shows that jeta(2) has one bogus (for reasons I do not know yet).

Screen Shot 2021-12-08 at 10 10 32 AM

Also look at variables fmajor and k ...

Screen Shot 2021-12-08 at 10 15 04 AM

@RobertPincus
Copy link

@DomHeinzeller @dustinswales Variable jeta is set on line 104 of the same file. Tracing up you can see there should be no way to get a negative value. I don't know what to make of the debugger not being able to read fmajor

@DomHeinzeller
Copy link
Contributor

DomHeinzeller commented Dec 8, 2021 via email

@RobertPincus
Copy link

I guess too, but who? And why would someone do such a thing 😢

@RobertPincus
Copy link

@DomHeinzeller In more seriousness, these variables are all internal to RRTMGP. The big tables like kmajor are private components of class structures. The smaller variables like jeta are internal to the subroutines.

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: jet
Compiler: intel
Job: RT
Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/796090531/20211208161511/ufs-weather-model
Please manually delete: /lfs4/HFIP/h-nems/emc.nemspara/RT_RUNDIRS/emc.nemspara/FV3_RT/rt_187200
Test control_stochy 026 failed in check_result failed
Test control_stochy 026 failed in run_test failed
Please make changes and add the following label back:
jet-intel-RT

@DeniseWorthen
Copy link
Collaborator

@dustinswales Logs are not being posted from any of the RTs because they can't get pushed to your fork. I'll collect them in once place for you to commit.

The cheyenne.intel baseline created but I think I'm not sure the verify ran or where it is located. I may just run it manually if I can't find it.

@DeniseWorthen
Copy link
Collaborator

The jet.intel failure of the control_stochy test was another case of the atmf000.nc file not comparing (same file as the control_csawmg_debug failure seen earlier).

@DeniseWorthen
Copy link
Collaborator

@dustinswales Please commit the logs in /scratch2/NCEPDEV/stmp1/Denise.Worthen/PR946

@junwang-noaa junwang-noaa merged commit f20ac76 into ufs-community:develop Dec 9, 2021
@DusanJovic-NOAA
Copy link
Collaborator

I am getting build error on my laptop when compiling mo_gas_optics_kernel:

[1/1075] Generating Fortran dyndep file FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/Fortran.dd
ninja: build stopped: multiple rules generate FV3/ccpp/physics/mo_gas_optics_kernels.mod.

in CMakeLists.txt I see two mo_gas_optics_kernels.F90 being compiled, both define same module (mo_gas_optics_kernels)

$ grep mo_gas_optics_kernels FV3/ccpp/physics/CMakeLists.txt 
                       ${LOCAL_CURRENT_SOURCE_DIR}/physics/rte-rrtmgp/rrtmgp/kernels-openacc/mo_gas_optics_kernels.F90
                       ${LOCAL_CURRENT_SOURCE_DIR}/physics/rte-rrtmgp/rrtmgp/kernels/mo_gas_optics_kernels.F90

Could this be the reason for the crash on Cheyenne.

@climbfuji
Copy link
Collaborator

I am getting build error on my laptop when compiling mo_gas_optics_kernel:

[1/1075] Generating Fortran dyndep file FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/Fortran.dd
ninja: build stopped: multiple rules generate FV3/ccpp/physics/mo_gas_optics_kernels.mod.

in CMakeLists.txt I see two mo_gas_optics_kernels.F90 being compiled, both define same module (mo_gas_optics_kernels)

$ grep mo_gas_optics_kernels FV3/ccpp/physics/CMakeLists.txt 
                       ${LOCAL_CURRENT_SOURCE_DIR}/physics/rte-rrtmgp/rrtmgp/kernels-openacc/mo_gas_optics_kernels.F90
                       ${LOCAL_CURRENT_SOURCE_DIR}/physics/rte-rrtmgp/rrtmgp/kernels/mo_gas_optics_kernels.F90

Could this be the reason for the crash on Cheyenne.

Definitely, this could be the problem. Need to figure out why this happens. Thanks for pointing this out!

@dustinswales dustinswales deleted the add_GPorts branch February 25, 2022 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants