Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for transient Smallville tests #1673 + testing all new datasets #2318

Merged
merged 56 commits into from
Feb 16, 2024

Conversation

slevis-lmwg
Copy link
Contributor

@slevis-lmwg slevis-lmwg commented Jan 12, 2024

alpha-ctsm5.2.mksrf.20_ctsm5.1.dev163

Script for generating landuse.timeseries and fsurdat files for Smallville dynLakes, dynUrban, and dynPft tests.

Specific notes

Contributors other than yourself, if any:
@ekluzek

CTSM Issues Fixed (include github issue #):
Partly addresses, does not fix #1673
Fixes #2218, though unrelated
Fixes #2319

Are answers expected to change (and if so in what way)?
Not relevant.

Any User Interface Changes (namelist or namelist defaults changes)?

Testing performed, if any:
Tests pointing to the new landuse.timeseries and fsurdat files passed (details posted in #1673)

This generates files for dynLakes, dynUrban, and dynPft tests.
@slevis-lmwg slevis-lmwg self-assigned this Jan 12, 2024
@slevis-lmwg slevis-lmwg added blocked: dependency Wait to work on this until dependency is resolved PR status: work in progress testing additions or changes to tests blocker another issue/PR depends on this one and removed blocked: dependency Wait to work on this until dependency is resolved labels Jan 12, 2024
@slevis-lmwg
Copy link
Contributor Author

I marked this PR as "blocker" to #1903.

@slevis-lmwg slevis-lmwg added the priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations label Jan 12, 2024
@slevis-lmwg slevis-lmwg added this to the ctsm5.2.0 milestone Jan 12, 2024
@@ -1314,7 +1314,7 @@ lnd/clm2/surfdata_esmf/ctsm5.2.0/surfdata_10x15_hist_78pfts_CMIP6_1850_c230517.n
<fsurdat hgrid="4x5" sim_year="1850" use_crop=".true." >
lnd/clm2/surfdata_esmf/ctsm5.2.0/surfdata_4x5_hist_78pfts_CMIP6_1850_c230517.nc</fsurdat>
<fsurdat hgrid="1x1_smallvilleIA" sim_year="1850" >
/glade/work/erik/ctsm_worktrees/ctsm5.2.mksurfdata/tools/mksurfdata_esmf/surfdata_1x1_smallvilleIA_hist_78pfts_CMIP6_1850_c230726.nc
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/surfdata_esmf/ctsm5.2.0/surfdata_1x1_smallvilleIA_hist_78pfts_CMIP6_1850-2015_c240103.nc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this one. I left the equivalent 2000 file unchanged for now, because this PR does not touch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and we will be renaming these files before they are final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From review with @ekluzek
Remove /glade/campaign/.../inputdata/ here and possibly elsewhere.

@@ -1426,7 +1426,7 @@ use_crop=".true.">lnd/clm2/surfdata_map/ctsm5.1.dev052/landuse.timeseries_mpasa1
<!-- Note that this transient file only goes up to year 1855.
This is sufficient for testing purposes, but could cause problems if you try to run beyond 1855. -->
<flanduse_timeseries hgrid="1x1_smallvilleIA" sim_year_range="1850-2000"
use_crop=".true." >lnd/clm2/surfdata_map/landuse.timeseries_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_c160127.nc</flanduse_timeseries>
use_crop=".true." >lnd/clm2/surfdata_esmf/ctsm5.2.0/landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-1855_dynPft_c240103.nc</flanduse_timeseries>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was the correct place to put the new dynPft file, though I'm not 100% sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the right place. We just want to put it alongside the other standard files. No need for it to be in some non-standard place.

Tested: modify_smallville.sh
Not tested: Makefile
@slevis-lmwg

This comment was marked as resolved.

if (.not. readvar) call endrun( msg=' ERROR: PCT_OCEAN NOT on surfdata file'//errMsg(sourcefile, __LINE__))
if (.not. readvar) call endrun( msg= &
' ERROR: PCT_OCEAN NOT on surfdata file but required when running ctsm5.2 or newer; ' // &
' you are advised to generate a new surfdata file using the mksurfdata_esmf tool ' // errMsg(sourcefile, __LINE__))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, this edit fixes #2218, which also needs to come in sooner rather than later.

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Open new issue about fsurdat versioning and put in IF TIME ALLOWS and add "next"

@@ -46,6 +46,7 @@ SUBSETDATA_1X1_SMALL := --lat 40.6878 --lon 267.0228 --site 1x1_smallvilleIA \
--pctpft 6.5 1.5 1.6 1.7 1.8 1.9 1.5 1.6 1.7 1.8 1.9 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
# NOTE: The 1850 smallvilleIA site is constructed to start with 100% natural vegetation, so we can test transition to crops
SUBSETDATA_1X1_SMALL1850 := --lat 40.6878 --lon 267.0228 --site 1x1_smallvilleIA --dompft 13 --pctpft 100
SUBSETDATA_1X1_SMALLTRANSIENT := --lat 40.6878 --lon 267.0228 --site 1x1_smallvilleIA
Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekluzek's would-be-nice suggestion:
Consolidate SUBSETDATA_1X1_SMALL
then separate 1850 and TRANSIENT later.

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the Makefile working has likely resulted in additional clean-up issues, so I'm leaving this unresolved in favor of a more general clean-up in the Makefile.

@ekluzek agreed with me on this. I will remove the empty checkbox to avoid confusion.

# modify_smallville.sh to generate three modified landuse.timeseries files needed for testing.
crop-smallville-transient : FORCE
$(SUBSETDATA_POINT) --create-landuse $(SUBSETDATA_1X1_SMALLTRANSIENT)
../modify_input_files/modify_smallville.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 214 needs tab(s) like the line above it.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 19, 2024

@slevis-lmwg is this ready to come into the ctsm5.2 branch? From our discussion you need to get the makefile working and test it all out, but it's mostly ready.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jan 20, 2024

@slevis-lmwg is this ready to come into the ctsm5.2 branch? From our discussion you need to get the makefile working and test it all out, but it's mostly ready.

Yes, mostly ready. Two things:

  • I want to test the Makefile with this PR's changes. Run make all-subset for this.
  • I would like to update the Makefile with all the grids that we need to generate. I might as well do that now because this is the last PR before I will generate all the datasets. Let me know if you disagree.

When I have updated the Makefile, look at #2201 for guidance on building mksurdata_esmf.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 23, 2024

@slevis-lmwg I merged the dev163 update to the ctsm5.2 branch, so you should update to it when you are ready.

The tag was: alpha-ctsm5.2.mksrf.19_ctsm5.1.dev163

@slevis-lmwg
Copy link
Contributor Author

This post in #2327 will be helpful in updating the Makefile.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Feb 7, 2024

TODO Run the NEON tests from the test-suites to confirm that updates work from the last couple of commits.

izumi PASS
derecho: PASS

@slevis-lmwg
Copy link
Contributor Author

End of day tests on derecho (where things seems stuck in the queue right now):

  • submitted aux_clm
  • submitted ctsm_sci
  • submitted three standalone tests marked as PEND in the long earlier post above.

@@ -2346,20 +2346,6 @@
<option name="comment" >Add at least one test of a FATES NEON site with PRISM precipitation</option>
</options>
</test>
<test name="SMS_D_Lm1_Mmpi-serial" grid="CLM_USRDAT" compset="I1PtClm50SpRs" testmods="clm/USUMB_nuopc">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slevis-lmwg I'd like you to reconsider this. The purpose of this one is to run a generic tower site and show that we can continue to do it. And that is a capability that we want to continue moving forward. NEON is highly curated so it doesn't provide a good example, and PLUMBER2 will be similar. But, we still want to show that someone can set something up for any random tower site.

Now, as I think about this, this site is setup to demonstrate PTCLM which we've removed. So it should be redone to use subset data and be based on it. That would be a good demonstration for this and for our current workflow.

I'll create an issue for this, and we can decide how this should happen. It wouldn't be hard to add to CTSM5.2, but it could also happen after CTSM5.2...

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing how this should happen sounds good to me @ekluzek.

I'm encountering some other errors that I probably need your help with. I will start with these two:

  • The next two fail in MODEL_BUILD as "intel" but I expect them to work as "gnu" based on a third test that behaved the same way:
ERS_D_Ld5_Mmpi-serial.1x1_mexicocityMEX.I1PtClm51SpRs.derecho_intel.clm-CLM1PTStartDate
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm51FatesRs.derecho_intel.clm-FatesCold

The error says:

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'

In the first test, I see "fi" in
/glade/work/slevis/git/mksurfdata_toolchain/cime_config/testdefs/testmods_dirs/clm/CLM1PTStartDate/shell_commands
Is that the "fi" that it's complaining about?

  • Then we have
FAIL SMS_Ld12_Mmpi-serial.1x1_urbanc_alpha.I1PtClm51SpRs.derecho_intel.clm-output_sp_highfreq RUN

failing with this error:
Abort with message NetCDF: Variable not found in file /glade/derecho/scratch/csgteam/temp/spack/derecho/23.09/builds/spack-stage-parallelio-2.6.2-bpi7h2bnkshocep4hl3drfik4ebj44iu/spack-src/src/clib/pio_nc.c at line 1164
The lnd.log ends with

(shr_ndep_readnl) Read in ndep_inparm namelist from: drv_flds_in
 LND: PIO numiotasks=           1
 LND: PIO stride=           1
 LND: PIO rearranger=           2
 LND: PIO root=           1

The atm.log ends with
(shr_strdata_readstrm) reading file lb: /glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-08.nc 23

UPDATE: I'm marking this as an EXPECTED FAILURE because datm.streams.xml lists

<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-08.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-09.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-10.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-11.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0001-12.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-01.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-02.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-03.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-04.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-05.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-06.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-07.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-08.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-09.nc</file>
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-10.nc</file
<file>/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/CLM1PT_data/urbanc_alpha.c080416/clm1pt-0002-11.nc</file>

while the corresponding directory (in /glade/campaign, as well as in /glade/p) contains only the first file and the last file and none of the rest.

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • SMS_Ld5.f09_g17.ISSP460Clm50BgcCrop.derecho_intel.clm-ciso_dec2050Start
    This one complains ERROR: No stream_entry presaero.SSP4-6.0 found'

I did not come across a relevant issue, bug, or PR in https://github.com/ESCOMP/CDEPS
I'm removing ssp4 tests, as recommended in the next post.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slevis-lmwg the first one is due to: ESMCI/ccs_config_cesm#130 So you just need to change it non-DEBUG or to gnu.

The second one (2) we need to see what file it's failing on. That might be in one of the log files, or you might see at least what file was last opening in the log files. Then we need to figure out what variable it is that it's failing to find...

On the third one (3) this means there likely isn't Prescribed aerosol files for SSP4-6.0. So we probably just shoudn't do any SSP4-6.0 tests. I would look into CDEPS to make sure this is a bug there though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I'm coming to this kind of late, but regarding the single point test that @slevis-lmwg suggested removing the USUMB test created from PTCLM. This does seem like something we should remove?

To avoid mission creep, I suggest we postpone testing for particular single point features to a PR that involves the development of single point features (NEON, PLUMBER2, or others), but defer to @ekluzek and @slevis-lmwg, as you're more deeply involved here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, if it's experimental, yes we should remove it from our testing, and label it as expected failure. We shouldn't worry about testing it, until they can show us what to do. And I don't want to have a test that requires a 100 nodes of Derecho as just getting through the queue to run takes too long and is expensive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dying in malloc, so it's definitely a memory issue. So it's going to take some work to figure it out. It's at initialization so it might be initialization needs more memory than at run-time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slevis-lmwg we should record this as an issue, so that people working on this can reference. And we'll be reminded of it being a problem. Could you start an issue for this?

Copy link
Contributor Author

@slevis-lmwg slevis-lmwg Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes. TODO open issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...]
Next, I think found a bug in /ccs_config/component_grids_nuopc.xml. I had to change this line

-  <domain name="mpasa15-3conus">
+  <domain name="mpasa15-3">

for the case to find the grid's mesh file. I'm making the following TODO checkbox:

  • Open another issue in CDEPS or does this go somewhere else?

@ekluzek I will go ahead with this merge, as we discussed, but I wanted to bring attention to the above comment that has two aspects:

  1. Open an issue?
  2. Logistically what are the steps to resolve it?

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Feb 10, 2024

Troubleshooting the LWISO failure here:
/glade/work/slevis/git/mksurfdata_toolchain/tests_0204-153040de/LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.derecho_gnu.clm-coldStart.GC.0204-153040de_gnu/run/cesm.log...

UPDATE 1: I seem to have solved the error for ice1_grc. Now I get the same error with qflx_ice_dynbal_grc, which I later found is due to ice2_grc.
UPDATE 2: I was too optimistic in the previous update. I reached out to Bill Sacks and discussed preferred options:

  1. FAIL: Skip the endrun for ice1_grc and ice2_grc in the hopes of finding which subgrid term is the culprit. I tried that and found that the simulation actually completes successfully in that case. So on to the next ideas.
  2. FAIL: This second idea did not come up when I talked to Bill: Remove the "call truncate_small_values" for snocan in case that fixed a fluke error before, and we happen to not need that fix anymore. Bill is skeptical about the longevity of this outcome, even if it pans out.
  3. IN PROGRESS: Look at the terms making up ice_mass in TotalWaterAndHeatMod.F90's subroutines ComputeLiqIceMassNonLake and ComputeLiqIceMassLake to find the culprit. In doing so, I have found a source of negative ice_mass, likely in snocan. Depending my findings, I may advocate for eliminating negative snocan as the solution.

I found no evidence of this test having been run before when serching in
/glade/campaign/cgd/tss/ctsm_baselines/
The test's datm.streams.xml lists 16 files of which only two exist in
the corresponding datm directory (whether I look in /glade/campaign or
/glade/p), so this test possibly never worked.
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Feb 14, 2024

LWISO failure UPDATE
Adding @billsacks because I discussed this failure with him yesterday.
The test passed when I made this one-line change in CanopyFluxesMod.F90:

@@ -1605,7 +1605,8 @@ bioms:   do f = 1, fn
          if (t_veg(p) > tfrz ) then ! above freezing, update accumulation in liqcan
             if ((qflx_evap_veg(p)-qflx_tran_veg(p))*dtime > liqcan(p)) then ! all liq evap
                ! In this case, all liqcan will evap. Take remainder from snocan
-               snocan(p)=snocan(p)+liqcan(p)+(qflx_tran_veg(p)-qflx_evap_veg(p))*dtime  
+               snocan(p) = max(0._r8, &
+                  snocan(p) + liqcan(p) + (qflx_tran_veg(p) - qflx_evap_veg(p)) * dtime)
             end if
             liqcan(p) = max(0._r8,liqcan(p)+(qflx_tran_veg(p)-qflx_evap_veg(p))*dtime)

Note:

  1. I used the line after the end if as a template for the change. Is there a reason not to use the max function for snocan as we use it for liqcan?
  2. The test passes; however, my troubleshooting reveals negative ice_mass (in the hundreds of kg/m2) in subr. ComputeLiqIceMassNonLake after subtracting dynbal_baseline_ice(c). I opened LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.derecho_gnu.clm-coldStart test failure in ctsm5.2 branch #2366 for this.

@wwieder
Copy link
Contributor

wwieder commented Feb 14, 2024

not sure where I'm supposed to reply, but if the crazy high resolution surface dataset doesn't work (and Brian can't make it work) I'd leave this as an expected failure and let someone who actually wants this dataset work on producing it. It doesn't need to be our problem right now. Was that you're question @slevis-lmwg?

@slevis-lmwg slevis-lmwg changed the title Workaround for transient Smallville tests #1673 Workaround for transient Smallville tests #1673 + testing all new datasets Feb 15, 2024
@billsacks
Copy link
Member

Thanks for your work on this @slevis-lmwg ! I'll comment in #2366 .

@slevis-lmwg slevis-lmwg marked this pull request as ready for review February 16, 2024 00:08
@slevis-lmwg slevis-lmwg merged commit cdc9cca into ESCOMP:ctsm5.2.mksurfdata Feb 16, 2024
1 of 2 checks passed
@slevis-lmwg slevis-lmwg deleted the smallville_iss1673 branch February 16, 2024 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker another issue/PR depends on this one priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations testing additions or changes to tests
Development

Successfully merging this pull request may close these issues.

4 participants