Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add testmods to fates regression test to increase coverage and include long-term exact restart fix #6018

Merged

Conversation

glemieux
Copy link
Contributor

@glemieux glemieux commented Oct 24, 2023

This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051.

Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include more run mode options for fates at a variety of resolutions and runtimes.

[nonBFB] for FATES
Fixes #5548

@glemieux
Copy link
Contributor Author

glemieux commented Oct 24, 2023

Below is the current list of existing clm-fates testmods. Ideally we would have parity between host land models.

To include:

  • SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.clm-FatesColdSatPhen
  • ERS_D_Ld5.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesCold
  • ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.clm-FatesColdNoComp
  • SMS_D_Ld5.f45_f45_mg37.I2000Clm51Fates.clm-FatesCold
  • ERP_Ld9.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesColdAllVars
  • SMS_Lm6.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-Fates
  • ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesColdTreeDamage
  • ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdLogging
  • ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdSizeAgeMort
  • ERS_Ld30.f45_f45_mg37.I2000Clm51FatesSpCruRsGs.clm-FatesColdSatPhen
  • ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesColdFixedBiogeo
  • ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdNoFire
  • ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdST3
  • ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdPPhys
  • ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdNoComp
  • ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdNoCompFixedBioGeo
  • ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdPRT2
  • [ ] ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdLandUse bring in with future PR
  • ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-Fates
  • ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.clm-FatesCold
  • ERP_Ld3.f09_g17.I2000Clm50FatesRs.clm-FatesCold
  • SMS_Ld5_PS.f19_g17.I2000Clm50FatesRs.clm-FatesCold
  • ERP_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.clm-FatesCold
  • ERP_D_P36x2_Ld3.f19_g17.I2000Clm50FatesCru.clm-FatesCold
  • SMS_D_Lm6_P144x1.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesCold

Site aliases that do not work currently (see #4718):

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.clm-FatesFireLightningPopDens--clm-NEON/FATES/NIWO
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.clm-FatesPRISM--clm-NEON/
ERS_D_Ld15.5x5_amazon.I2000Clm50FatesRs.clm-FatesColdSeedDisp
PEM_D_Ld15.5x5_amazon.I2000Clm50FatesRs.clm-FatesColdSeedDisp
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.clm-FatesCold
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.clm-FatesCold
ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.clm-FatesCold
ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm51FatesRs.clm-FatesCold
SMS_Lm13.1x1_brazil.I2000Clm50FatesCruRsGs.clm-FatesCold
SMS_D.1x1_brazil.I2000Clm51FatesSpCruRsGs.clm-FatesColdSatPhen
SMS_D.1x1_brazil.I2000Clm51FatesSpCruRsGs.clm-FatesColdDryDepSatPhen
SMS_D.1x1_brazil.I2000Clm51FatesSpCruRsGs.clm-FatesColdMeganSatPhen
ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.clm-FatesColdHydro
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.clm-FatesCold
SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.clm-FatesColdHydro

No equivalent elm-fates mode:

ERS_Lm12.1x1_brazil.I2000Clm50FatesCruRsGs.clm-FatesFireLightningPopDens

ctsm-fates specific compset:

SMS_Ld3.f09_g17.I2000Clm51FatesSpCruRsGs.clm-FatesColdSatPhen_prescribed
SMS_Lm1.f45_f45_mg37.I2000Clm51FatesSpCruRsGs.clm-FatesColdBasic
ERP_P36x2_Ld30.f45_f45_mg37.I2000Clm51FatesSpCruRsGs.clm-FatesColdSatPhen

No f10 resolution equivalent in E3SM?:

ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.clm-FatesCold
ERS_Lm13.f10_f10_mg37.I2000Clm50Fates.clm-FatesCold
SMS_D_Ld5.f10_f10_mg37.I2000Clm45Fates.clm-FatesCold
SMS_Ld5.f10_f10_mg37.I2000Clm45Fates.clm-FatesCold
SMS_Ld5.f10_f10_mg37.I2000Clm50FatesRs.clm-FatesCold
SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.clm-FatesCold
ERS_Ld9.f10_f10_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdCH4Off

Has duplicate testmod with slightly different ctsm compset already included above:

ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdFixedBiogeo
ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.clm-FatesColdSizeAgeMort
ERP_Ld9.f45_f45_mg37.I2000Clm50FatesCruRsGs.clm-FatesColdAllVars
ERS_Ld5.f19_g17.I2000Clm45Fates.clm-FatesCold

@glemieux glemieux changed the title [WIP] Add testmods to fates regression test suite to increase FATES run mode coverage [WIP] Add testmods to fates regression test to increase coverage and include long-term exact restart fix Nov 2, 2023
@glemieux
Copy link
Contributor Author

glemieux commented Nov 3, 2023

Initial testing with a truncated set of the above list is showing good results. Only two tests failed RUN: the Lm13 test simply hit the wallclock limit. The fates_prt2 test mod resulted in a segfault due to an invalid memory reference.

@glemieux
Copy link
Contributor Author

glemieux commented Nov 13, 2023

UPDATE: we'll take care of this in a different PR

@glemieux
Copy link
Contributor Author

@rljacob I noticed from the email listserv that there is consideration for removing support for ne4 and ne11. FATES in some cases may want to have support for long test runs (e,g, Lm25), with coarse grids for the time being. I've been trying to get f10 working, but without luck. Perhaps fates could use the ne4 grid in this place?

@rljacob
Copy link
Member

rljacob commented Nov 13, 2023

@glemieux I've updated that and we'll keep ne4 but for an atmosphere only test. You should try ne4pg2 which is the finite volume version of the plain "ne4" and will be our standard low resolution grid.

@glemieux glemieux changed the title [WIP] Add testmods to fates regression test to increase coverage and include long-term exact restart fix Add testmods to fates regression test to increase coverage and include long-term exact restart fix Nov 15, 2023
@glemieux
Copy link
Contributor Author

Adding a note that given the lack of a supported single site, we're not including fates hydro tests yet.

@glemieux
Copy link
Contributor Author

glemieux commented Nov 22, 2023

@peterdschwartz @bishtgautam this PR is ready for review. In particular, @rgknox and I would like know what you think about the fates testmods that we added to the e3sm_land_developer test list.

As an aside, I'm wondering if there might be support for addressing #4718 such that we could add 1x1_brazil as a supported resolution eventually.

@glemieux
Copy link
Contributor Author

I'm noticing an interesting RMS diff with the fates_cold_allvars ERP testmod results. It fails COMPARE_base_rest, but only on a single output:

 SOIL2N_TNDNCY_VERT_TRANS   (lon,lat,levdcmp,time)  t_index =     10    10
          1    49680  (    30,    19,     2,     1) (    30,    19,     1,     1) (    11,    24,     1,     1) (    11,    24,     1,     1)
               21990   9.866014255566213E-12         -1.555327221181013E-11 1.7E-24 -1.576961273444979E-17 4.8E-12 -1.576961273444979E-17
               21990   9.866014255566213E-12         -1.555327221181013E-11         -1.576961438881101E-17         -1.576961438881101E-17
               49680  (    30,    19,     2,     1) (    30,    19,     1,     1)
          avg abs field values:    1.132267023196132E-13    rms diff: 0.0E+00   avg rel diff(npos):  4.8E-12
                                   1.132267023196132E-13                        avg decimal digits(ndif):  7.0 worst:  7.0
 RMS SOIL2N_TNDNCY_VERT_TRANS         0.0000E+00            NORMALIZED  0.0000E+00

Looking through the e3sm_land_developer test output, it appears that the only other testmod that outputs this variable is the SMS smallville test.

@rgknox is this something to be concerned about?

@glemieux
Copy link
Contributor Author

Waiting on perlmutter to come back up to check the status of the ERS version of this test to see if it is a general E3SM issue or specific to fates.

@rljacob
Copy link
Member

rljacob commented Dec 7, 2023

notes: perlmutter back up so can work on it again.

@glemieux
Copy link
Contributor Author

@ckoven and @rgknox the results from the elm-only tests that I've run don't show differences in the exact restart, so I'm currently thinking that this specific to fates. Inspecting the files, it looks like the difference is in one grid cell, for the first levdcmp dimension only (at least for this short nine day test).

@glemieux
Copy link
Contributor Author

@peterdschwartz I've worked around issue #6125 by changing the grid resolution for the fates_cold_allvars testmod where we were seeing the odd exact restart issue. I've rebased against master and I'm rerunning comparisons against the that baseline on perlmutter currently. Assuming this comes back clean, this should be ready to go.

@rljacob
Copy link
Member

rljacob commented Jan 4, 2024

@peterdschwartz this is ready for review and merge

Adds multiple fates testmods to provide a more comprehensive set of
cases to exercise a variety fates run modes.  The fates test list is
expanded to encompass all the new testmods.  The elm land developer
test list has been updated with a subset of these new fates tesmods to
provide more baseline coverage.  This also updates all fates tests to
run as double precision.

Finally, this commit provides a minor FATES API update to incorporate a
fates fix to an issue seen with long-duration exact restarts.
@glemieux
Copy link
Contributor Author

glemieux commented Jan 4, 2024

Final testing of e3sm_land_developer suite on perlmutter is underway. Should have results later today.

UPDATE: a bunch of tests still sitting in queue. Should have updates tomorrow morning.

@glemieux
Copy link
Contributor Author

glemieux commented Jan 5, 2024

Regression testing on perlmutter against the master baseline is returning b4b for all tests. The new fates testmods that have been added return an expected BFAIL during comparison since they do not have an official master baseline yet.

@peterdschwartz
Copy link
Contributor

Testing on chrysalis after merging and the ERP_Ld15.ne4pg2_ne4pg2.IELMFATES.chrysalis_intel.elm-fates_cold_allvars test has a runtime failure

 13:  decompInit_lnd(): Number of clumps exceeds number of land grid cells
 13:          256         211
 13:  ENDRUN:
 13:  ERROR in decompInitMod.F90 at line 183
 13:
 13:
 13:
 13:
 13:
 13:
 13:  ERROR: Unknown error submitted to shr_abort_abort.

@peterdschwartz
Copy link
Contributor

peterdschwartz commented Jan 17, 2024

@glemieux Looks like this test is the only fates test with nthreads=2.

@glemieux
Copy link
Contributor Author

glemieux commented Jan 17, 2024

@peterdschwartz Good catch. I missed that #4283 has been an outstanding issue. I wonder why this is passing on perlmutter but not on chrysalis.

@rgknox
Copy link
Contributor

rgknox commented Jan 17, 2024

The clump decomposition should be the same for fates and the base land model... Maybe something is getting bypassed or missed when the use_fates switch is on that would had otherwise ran for non-fates ELM. It might be work running this same test without FATES to see if the same error triggers.

@glemieux
Copy link
Contributor Author

@peterdschwartz I'll update the offending testmod to run with single threading for now. I should be able to update this today.

@peterdschwartz
Copy link
Contributor

@peterdschwartz I'll update the offending testmod to run with single threading for now. I should be able to update this today.

Great! also, it doesn't appear to have been using fates at all. Likely because the include_user_mod file was missing from the testdef

@glemieux
Copy link
Contributor Author

glemieux commented Jan 22, 2024

Great! also, it doesn't appear to have been using fates at all. Likely because the include_user_mod file was missing from the testdef

That was intentional since the fates testmod clears the history tapes, but we don't want that for fates_cold_allvars. The single threading update was made via d8951c0. I tested the fates_cold_allvars again on perlmutter and confirmed that it was single threaded.

@peterdschwartz is there any other testing from me that is necessary?

@peterdschwartz
Copy link
Contributor

Test passed on chrysalis as well so this should be ready to merge to next

peterdschwartz added a commit that referenced this pull request Jan 30, 2024
… next (PR #6018)

This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart.
This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051.

Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include
more run mode options for fates at a variety of resolutions and runtimes.

[non-BFB] for FATES
Fixes #5548
@peterdschwartz
Copy link
Contributor

Merged to next.

@peterdschwartz
Copy link
Contributor

peterdschwartz commented Jan 31, 2024

@glemieux fates_cold_allvars is failing due to a NETCDF error on pm-cpu and mappy (chrysalis seems ok, based on manual inspection).

PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 3780) to file (./ERP_Ld15.ne4pg2_ne4pg2.IELMFATES.pm-cpu_intel.elm-fates_cold_allvars.C.JNextIntegration20240130_205133.elm.h0.0001-01-01-00000.nc, ncid=53) using PIO_IOTYPE_PNETCDF iotype failed. 
Non blocking write for variable (CMASS_BALANCE_ERROR, varid=76) failed (Number of subarray requests/regions=1, Size of data local to this process = 192). 
NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u2/e/e3smtest/jenkins/workspace/ACME_Perlmutter_next_intel/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 395)

Backtrace points to call ncd_pio_closefile(nfid(t))

@peterdschwartz
Copy link
Contributor

@glemieux Do you still have directory of a successfully running this case on pm-cpu? Might be helpful to compare the software environments to see if that is the issue.

@glemieux
Copy link
Contributor Author

@peterdschwartz here's the location on perlmutter where I retested fates_cold_allvars after updating to single threading.

perlmutter location: /pscratch/sd/g/glemieux/e3sm-tests/pr6018-singlethread.fates.pm-cpu.gnu.Cd8951c03c6-F44ba97a8

@rgknox
Copy link
Contributor

rgknox commented Jan 31, 2024

I wonder if the "numeric conversion not representable" is coming up because a nan or inf is in the array somewhere. We initialize some fates variables with nan.

@peterdschwartz
Copy link
Contributor

peterdschwartz commented Jan 31, 2024

The debug run didn't report anything different, but manually zeroing out grc_cs%errcb after error checking does allow the test to PASS. So, doing another run checking for NaNs, then infs I suppose

@peterdschwartz
Copy link
Contributor

peterdschwartz commented Jan 31, 2024

@rljacob @rgknox
We have a fix for this test, and will re-merge this PR tonight with it. The test only FAILed when merged to next and conflicts from PR #6058 were identified .

While debugging, I noticed the carbon balance checking algorithm in ELM is wrong. This is an error independent from this PR and fixing it is likely non-BFB for some cases, so I will make an Issue and work on a PR for that bug separately.

Moved from col_cf_setvalues to col_cf_zero_forfates_veg
peterdschwartz added a commit that referenced this pull request Jan 31, 2024
Re-merged to fix fates_cold_allvars test
@peterdschwartz
Copy link
Contributor

Re-merged to next

@peterdschwartz peterdschwartz merged commit 318d612 into E3SM-Project:master Feb 1, 2024
3 checks passed
@peterdschwartz
Copy link
Contributor

merged to master and bless requests submitted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expanding test coverage of ELM-FATES run modes
4 participants