failure in ERS 3 month nocomp test #1051

mvertens · 2023-07-11T12:30:39Z

I am trying to create longer runs using the latest CTSM in NorESM configurations (with the latest CMEPS, CDEPS, etc). The following 2 tests fail restarts:
ERS_Lm3.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire

ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire

However what is interesting is that this test passes restart:
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire.20230707_114133_1m5tec/

Since the restart test fails - we cannot really start an longer runs until we resolve this problem.

mvertens · 2023-07-11T14:13:34Z

@rosiealice @glemieux - I was not sure where to post this. The code is very close to ctsm5.1.dev129.

rosiealice · 2023-07-11T16:14:40Z

I guess a higher level question is whether we should push this NOCOMP compset back into CTSM to streamline the testing, given it (nocomp) is a pretty core use-case of FATES at the moment? Should we discuss this on a FATES SE call one day (those are at 9pm on Mondays in Norway Mariana)

I don't know what the difference is between these two test variants...

glemieux · 2023-07-11T16:31:14Z

Hi @mvertens, I'll try replicating this test. As @rosiealice asked, what's the difference between the two ERS_Ld90 tests? Were they using different fates tags?

mvertens · 2023-07-11T17:41:16Z

@glemieux - thanks so much for your quick response to this.
Sorry I had wrong test written down incorrectly.
It should be
ERS_Ld120.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
that fails and the following passes.
ERS_Ld90.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
I also tried ERS_Ld30 and ERS_Ld60 and those passed as well.

glemieux · 2023-07-11T19:38:58Z

I can confirm that I'm seeing similar COMPARE_base_rest failures using the following test case setup:

ERS_Lm3.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.clm-FatesColdNoComp

Here's what I checked varying the durations:

Lm3: fails
Ld90: passes
Ld120: fails
Lm6: fails

Here are other checks I did using Lm3:

CRUv7 versus GSWP3v1: both fail
FatesCold versus FatesColdNoComp: FatesCold passes
ctsm5.1dev126 versus ctsm5.1.dev129: both fail

This is making me wonder if #897 is actually related to this issue.

mvertens · 2023-07-13T18:01:09Z

@glemieux - thanks for confirming this. The one way I have always used to iron out restart problems is to get close to where the problem is and write history files every time step - for an initial run and a restart run. I can try to narrow this down some more. Clearly stopping somewhere between day 90 and day 120 is causing the problem. So say we know that stopping at day 100 and restarting will not work. Lets assume that just looking at the day boundary is good enough. Then one way to debug this is the following - assume that coupler history files and clm history files are written every time step.

Do a run until day 99 - and write write a restart at the end
Restart from day 99 and run for 50 time (assuming half hour time steps) and write a restart at day 100
Save the history files
Now restart again at and run for 3 time steps. Compare the history files for time steps 48-50 and see where the differences start.

I've usually been successful using this approach - and just did that to find a problem for a restart problem using a compset with ocean, ice, wave and datm. Anyway - I'm happy to help as well - and am also happy to clarify things if the explanation above has been confusing.
4) compare

rgknox · 2023-07-13T18:13:18Z

@mvertens do you know what grid-cell is showing the diff?

Also, just to confirm, I see you have a FatesCold test that passes with LM3. Have you found any non-Nocomp tests failing, or are failures just nocomp configurations?

mvertens · 2023-07-14T08:53:00Z

@rgknox - yes that is correct. The only failures I have found are in the one nocomp configuration. I have only added 4 regressions tests at this point to the NorESM CTSM fork - as a way of moving forwards to run this with CAM60, CICE and BLOM.
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.betzy_intel.clm-FatesColdTreeDamage (Overall: PASS) details:
ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.betzy_intel.clm-FatesCold (Overall: PASS) details:
ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.betzy_intel.clm-FatesColdNoComp (Overall: FAIL) details:
FAIL ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.betzy_intel.clm-FatesColdNoComp COMPARE_base_rest
SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.betzy_intel.clm-FatesColdSatPhen (Overall: PASS) details:

glemieux · 2023-09-29T17:37:30Z

A bunch of us met today to brainstorm next steps on how to approach this issue. I reported that yesterday I ran a few ERS FatesColdNoComp test mods at different durations and narrowed down the restart to some time between 52 and 58 days (Ld100 and Ld110 respectively) and is showing up first in southern latitudes (New Zealand and south Argentina). The issue presents itself right at the start up time. There are some additional nearby gridcells that start showing issues towards the end of the run suggesting that this issue might be more wide spread in the boreal region, but just swamped out because of the plotting range. The latest test runs are with the latest ctsm and fates tags.

Some additional notes from the meeting:

Could we be missing a variable in the restart file? Is the restart using the cold start value that just happens to be extremely small?
Given the evidence that this is happening in the southern boreal region, could this have something to do with deciduous phenology?
If the restart issue appears using a fully spun up case, we probably should accelerate addressing Create a non cold-start FATES test ESCOMP/CTSM#1277

The following actions are going to be taken:

@mvertens will run a restart tests with a spin up data set from @rosiealice
@rgknox will investigate running with evergreens only

We've agreed to meet in two weeks time to discuss progress.

mvertens · 2023-10-03T18:16:19Z

I have an update on action 2.

I am using a ctsm5.1dev130-noresm_v1 from https://github.com/NorESMhub/CTSM
I tried to use a spun up CLM dataset that Rosie provided and the model crashed in FATES. I can provide that error if someone wants it.
I decided to simply to a 10 year spin up starting from a cold start using the following test.
SMS.f45_f45_mg37.2000_DATM%GSWP3v1_CLM51%FATES-NOCOMP_SICE_SOCN_SROF_SGLC_SWAV_SESP.betzy_intel.clm-FatesColdNoCompNoFire
Starting from the restart file at year 10 - I did 2 runs. A 4 time step initial run. And a 2 time step initial run following by a 2 timestep restart run. I was able to show that both the restart files at timestep 3 and the history files at timestep 3 (clm2.h0.2000-01-01-05400.nc) were different in a small number of variables.
The history files comparison showed the following diffs:

RMS FATES_CBALANCE_ERROR             2.6008E-18            NORMALIZED  2.3165E+00
RMS FATES_DISTURBANCE_RATE_P2P       3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_DISTURBANCE_RATE_POTENTIAL 3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_DISTURBANCE_RATE_TREEFALL  3.3688E-03            NORMALIZED  2.0000E+00
RMS FATES_ERROR_EL                   2.6008E-18            NORMALIZED  2.3165E+00
RMS FATES_FRAGMENTATION_SCALER_SL    2.4246E-01            NORMALIZED  2.0358E+00
RMS FATES_PRIMARY_PATCHFUSION_ERR    8.1046E-14            NORMALIZED  2.0000E+00

The restart files showed the following diffs:

RMS tempavg_bgnpp                           NaN            NORMALIZED  0.0000E+00
RMS annsum_counter_ch4                      NaN            NORMALIZED  0.0000E+00
RMS tempavg_somhr                           NaN            NORMALIZED  0.0000E+00
RMS tempavg_finrw                           NaN            NORMALIZED  0.0000E+00
RMS fates_fcansno_pa                 1.6970E-03            NORMALIZED  1.7092E-07
RMS fates_solar_zenith_angle_pa      1.5994E-04            NORMALIZED  1.6109E-08
RMS fates_gnd_alb_dif                6.2440E-03            NORMALIZED  1.5925E+00
RMS fates_gnd_alb_dir                6.3061E-03            NORMALIZED  1.6566E+00
RMS fates_errfates_vec_001           9.5475E-14            NORMALIZED  5.6401E+00

I am really perplexed as to why the albedos are different. I really would need some advice as to how to proceed with this. I have not looked at FATES for several years. I am happy to add new variables to the restart files. Would another chat help?
Should we wait for Rosie to come back?

mvdebolskiy · 2023-10-03T20:50:21Z

@rgknox can this be due to #428 and ESCOMP/CTSM#548 ? Do you remember what were the tests performed when merging that PR?

ckoven · 2023-10-03T23:41:06Z

@mvertens one question I have is on the time stepping that you did. Much of the FATES-specific processes only occur once per day, so every 48th timestep. Was the daily timestep one of the two timesteps in your restart run? Because if not, that says that the daily FATES code shouldn't even be triggered in the experiment you did, despite you showing differences in things like the history-reported disturbance rates? So if that is the case it could considerably narrow the set off possible errors here.

mvertens · 2023-10-04T06:37:00Z

@ckoven - the daily time step was not one of the two timesteps in my restart. I started at the beginning of the day and did 4 timesteps in one run and 2/2 in the restart run. I then compared the two timesteps at timestep 4.

rgknox · 2023-10-04T16:41:16Z

@mvdebolskiy that could be a good lead. We had a test suit in 2018 similar to what we have now, but it may had not contained tests long enough.

rgknox · 2023-10-04T17:15:57Z

@mvdebolskiy , I'm noticing that there is some similar code doing the same thing in the call to PatchNorman radiation during the restart and the normal call sequence. Even if this is not the problem we are tackling in this thread, I think it would be better to have the zero'ing and initializations contained in one routine that is called from both locations.

See: https://github.com/NGEET/fates/blob/main/main/FatesRestartInterfaceMod.F90#L3588-L3644
See: https://github.com/NGEET/fates/blob/main/biogeophys/EDSurfaceAlbedoMod.F90#L106-L164

rgknox · 2023-10-04T17:33:25Z

Looks like this line is different in the restart process vs normal timestepping:
https://github.com/NGEET/fates/blob/sci.1.67.4_api.27.0.0/main/FatesRestartInterfaceMod.F90#L3630

rgknox · 2023-10-04T19:26:34Z

I ran a test where I modify the line linked in my previous post to match that of the call in EDSurfaceAlbedoMod, but it did not remove the error. I think we are getting closer though, the diffs that you uncovered in the restart file @mvertens do seem to point to something in the radiation restart.

mvdebolskiy · 2023-10-05T10:21:50Z

@rgknox
In the ED_Norman_Radiation() assignment starts from 1st pft.

fates/biogeophys/EDSurfaceAlbedoMod.F90

Line 100 in 21e18c6

if(currentpatch%nocomp_pft_label.ne.nocomp_bareground)then

In the update_3dpatch_radiation() there is no such condition. Do I understand correctly, that when reading from the restart bareground pft will get the values from update_3dpatch_radiation() in nocomp mode while in competition mode it will get just clm restart values?

rosiealice · 2023-10-05T11:38:34Z

A thought. If this is a radiation code thing, does it fail the test with the new two stream scheme?

rgknox · 2023-10-05T14:03:04Z

@rosiealice its worth testing.
@mvdebolskiy I think you found a real bug there.
Update: I applied the filter @mvdebolskiy pointed out to the restart process, but I am still generating differences in an ERS test.

rgknox · 2023-10-05T15:15:33Z

A couple of other things I tried:
Running with only evergreens: failure persists
*Running in debug mode with gnu: failure persists

I ran the latter test because I was concerned the radiation scheme may had been pulling from indices outside of the allocation bounds on arrays, and thus mixing in un-initialized data. It was a long shot, but this was not the issue.

ckoven · 2023-10-05T15:46:06Z

@mvertens this is probably a separate issue from the fundamental one(s) responsible for the problems here, but the history fields FATES_DISTURBANCE_RATE_P2P, FATES_DISTURBANCE_RATE_POTENTIAL, FATES_DISTURBANCE_RATE_TREEFALL that are showing up as non-bit-for-bit in your experiment despite being tied to the daily timestep are all either removed or reworked as part of #1040. I will put in code to make sure that the underlying variable for these is restarted correctly into that PR. [Edit: I have now put this in via commit 853efcf which is on the #1040 branch]

mvdebolskiy · 2023-10-10T16:58:20Z

Small update, I;ve expanded @mvertens investigations:
I've run 2 runs from a restart file that was produced by a year run from coldstart with compset 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV and no modifications.
To runs have following xml changes:

./xmlchange STOP_N=2,STOP_OPTION=ndays
./xmlchange REST_OPTION=nsteps,REST_N=2
./xmlchange RESUBMIT=0
./xmlchange RUN_REFDATE=2001-01-01
./xmlchange RUN_STARTDATE=2001-01-01

and

./xmlchange STOP_N=1,STOP_OPTION=ndays
./xmlchange REST_OPTION=nsteps,REST_N=2
./xmlchange RESUBMIT=1
./xmlchange RUN_REFDATE=2001-01-01
./xmlchange RUN_STARTDATE=2001-01-01

After looking at the difference between the 2 files, I've found that pfts1d_active is different in 1 value.

Looking further, I've found that active columns and pfts are setup through calling set_active() which calls is_active_p() during initialization and restart. However, when Fates is running, it patch%wtcol is assigned patch%wt_ed in here. The update for wtcol, and happens once a day.

The restart files for 2001-01-01-00000 are identical for both fates_nocomp = .true. and fates_nocomp=.false.
However, subsequent restarts show differences, for fates_nocomp = .true.:

For fates_nocomp

RMS pfts1d_wtxy                      4.1023E-13            NORMALIZED  9.0097E-12
 RMS pfts1d_wtlnd                     4.2316E-13            NORMALIZED  4.0344E-12
 RMS pfts1d_wtcol                     4.2316E-13            NORMALIZED  1.5587E-12
 RMS DZSNO                            2.4262E-09            NORMALIZED  6.8734E-08
 RMS ZSNO                             2.8319E-09            NORMALIZED  1.8190E-08

And for fates_nocomp=.false.:

RMS pfts1d_active                    5.8456E-03            NORMALIZED  1.2956E-02
 RMS pfts1d_wtxy                      1.4848E-03            NORMALIZED  2.9640E-02
 RMS pfts1d_wtlnd                     1.5813E-03            NORMALIZED  1.3703E-02
 RMS pfts1d_wtcol                     1.5813E-03            NORMALIZED  5.2942E-03
 RMS DZSNO                            7.9897E-07            NORMALIZED  2.2642E-05

This is not a full list, since those are a bit long.

I will check if this happens with fates_spitfire_mode=0 tomorrow.

Forgot to mention. The values are different for around 2800 patches out of around 32k.

mvdebolskiy · 2023-10-12T16:13:22Z

OK, reading through Fates calls during restart, I've found that there are cohorts being terminated during the reading of the restart file for the failing tests.

rgknox · 2023-10-12T16:27:33Z

That could certainly explain the diffs. I see we call restart() in the clmfates_interfaceMod, and in that routine we call update_site(), which calls termination here:
https://github.com/NGEET/fates/blob/sci.1.67.5_api.27.0.0/main/EDMainMod.F90#L808-L809

Looking for more instances.

rosiealice · 2023-10-12T16:28:50Z

Nice find @mvdebolskiy !

mvdebolskiy · 2023-10-12T16:42:20Z

@rgknox More specifically. The termination happens in canopy_structure(), in the call index 14 after the cohort fuse.

rgknox · 2023-10-12T17:55:59Z

@ckoven and I were talking about the call to canopy_structure(), and in theory, we shouldn't need that call during the restart read. There is no need to perform promotions/demotions and such during a restart, nor terminations. The information on canopy position and such could be retained in the restart file.
I am curious though, why it would be terminating here, one would think that the cohort being terminated would had been already terminated prior to the restart write.

mvdebolskiy · 2023-10-13T09:11:06Z

@rgknox
I've tried to change the condition in the clm_driver to is_end_curr_day() but it did not help.

rgknox · 2023-10-13T14:38:28Z

Note, I tried a test using two-stream instead of norman radiation, does not have an effect on the restart compare.

rosiealice · 2023-10-13T14:48:40Z

It is indeed confusing why things should be terminating on restart and why this is triggered after and not before the restart occurs. I also agree that we should not in principle need to call canopy_structure during the restarting process. Doing away with it might prevent this termination occurring, but might actually mask the reason why the conditions wrt the termination change? Probably would be useful to track down what is triggering the termination here?

mvdebolskiy · 2023-10-13T15:31:49Z

fates branch I am working in
ctsm

Regarding failing pfts: it's not just 5 and 6 but 1,3,9,12.

mvdebolskiy · 2023-10-13T16:07:41Z

Things from today's meeting

Chek if init_cold is not called during restart read
Make NCL_p a restart var and switch of canopy_structure call during restart read
try to fuse on the restart write to see if there are cohorts to fuse
dump small n cohorts into log on restart write and read
Get rid of the variables FATES_DISTURBANCE_RATE_P2P, FATES_DISTURBANCE_RATE_POTENTIAL, FATES_DISTURBANCE_RATE_TREEFALL from history files, so the test don't fail because of them

glemieux · 2023-10-13T17:21:10Z

During the meeting I ran a 110 day 1x1_brazil at @rosiealice suggestion to see if we can narrow down to a specific site to make troubleshooting a little easier. I wasn't able to get the issue to replicate in this configuration.

I also tried adding a check for hlm_is_restart in canopy_structure to avoid the do while(area_not_balanced) loop, hoping to avoid the termination calls here:

fates/biogeochem/EDCanopyStructureMod.F90

Lines 190 to 300 in a3048a6

    
           ! Does any layer have excess area in it? Keep going until it does not... 
        
           patch_area_counter = 0 
        
           area_not_balanced = .true. 
        
           do while(area_not_balanced) 
        
              ! --------------------------------------------------------------------------- 
        
              ! Demotion Phase: Identify upper layers that are too full, and demote them to 
        
              ! the layers below. 
        
              ! --------------------------------------------------------------------------- 
        
              ! Its possible that before we even enter this scheme 
        
              ! some cohort numbers are very low.  Terminate them. 
        
              call terminate_cohorts(currentSite, currentPatch, 1, 12, bc_in) 
        
              ! Calculate how many layers we have in this canopy 
        
              ! This also checks the understory to see if its crown 
        
              ! area is large enough to warrant a temporary sub-understory layer 
        
              z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.) 
        
              do i_lyr = 1,z ! Loop around the currently occupied canopy layers. 
        
                 call DemoteFromLayer(currentSite, currentPatch, i_lyr, bc_in) 
        
              end do 
        
              ! After demotions, we may then again have cohorts that are very very 
        
              ! very sparse, remove them 
        
              call terminate_cohorts(currentSite, currentPatch, 1,13,bc_in) 
        
              call fuse_cohorts(currentSite, currentPatch, bc_in) 
        
              ! Remove cohorts for various other reasons 
        
              call terminate_cohorts(currentSite, currentPatch, 2,13,bc_in) 
        
              ! --------------------------------------------------------------------------------------- 
        
              ! Promotion Phase: Identify if any upper-layers are underful and layers below them 
        
              ! have cohorts that can be split and promoted to the layer above. 
        
              ! --------------------------------------------------------------------------------------- 
        
              ! Re-calculate Number of layers without the false substory 
        
              z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.) 
        
              ! We only promote if we have at least two layers 
        
              if (z>1) then 
        
                 do i_lyr=1,z-1 
        
                    call PromoteIntoLayer(currentSite, currentPatch, i_lyr) 
        
                 end do 
        
                 ! Remove cohorts that are incredibly sparse 
        
                 call terminate_cohorts(currentSite, currentPatch, 1,14,bc_in) 
        
                 call fuse_cohorts(currentSite, currentPatch, bc_in) 
        
                 ! Remove cohorts for various other reasons 
        
                 call terminate_cohorts(currentSite, currentPatch, 2,14,bc_in) 
        
              end if 
        
              ! --------------------------------------------------------------------------------------- 
        
              ! Check on Layer Area (if the layer differences are not small 
        
              ! Continue trying to demote/promote. Its possible on the first pass through, 
        
              ! that cohort fusion has nudged the areas a little bit. 
        
              ! --------------------------------------------------------------------------------------- 
        
              z = NumPotentialCanopyLayers(currentPatch,currentSite%spread,include_substory=.false.) 
        
              area_not_balanced = .false. 
        
              do i_lyr = 1,z 
        
                 call CanopyLayerArea(currentPatch,currentSite%spread,i_lyr,arealayer(i_lyr)) 
        
                 if( ((arealayer(i_lyr)-currentPatch%area)/currentPatch%area > area_check_rel_precision) .or. & 
        
                      ((arealayer(i_lyr)-currentPatch%area) > area_check_precision )  ) then 
        
                    area_not_balanced = .true. 
        
                 endif 
        
              enddo 
        
              ! --------------------------------------------------------------------------------------- 
        
              ! Gracefully exit if too many iterations have gone by 
        
              ! --------------------------------------------------------------------------------------- 
        
              patch_area_counter = patch_area_counter + 1 
        
              if(patch_area_counter > max_patch_iterations .and. area_not_balanced) then 
        
                 write(fates_log(),*) 'PATCH AREA CHECK NOT CLOSING' 
        
                 write(fates_log(),*) 'patch area:',currentpatch%area 
        
                 do i_lyr = 1,z 
        
                    write(fates_log(),*) 'layer: ',i_lyr,' area: ',arealayer(i_lyr) 
        
                    write(fates_log(),*) 'rel error: ',(arealayer(i_lyr)-currentPatch%area)/currentPatch%area 
        
                    write(fates_log(),*) 'abs error: ',arealayer(i_lyr)-currentPatch%area 
        
                 enddo 
        
                 write(fates_log(),*) 'lat:',currentSite%lat 
        
                 write(fates_log(),*) 'lon:',currentSite%lon 
        
                 write(fates_log(),*) 'spread:',currentSite%spread 
        
                 currentCohort => currentPatch%tallest 
        
                 do while (associated(currentCohort)) 
        
                    write(fates_log(),*) 'coh ilayer:',currentCohort%canopy_layer 
        
                    write(fates_log(),*) 'coh dbh:',currentCohort%dbh 
        
                    write(fates_log(),*) 'coh pft:',currentCohort%pft 
        
                    write(fates_log(),*) 'coh n:',currentCohort%n 
        
                    write(fates_log(),*) 'coh carea:',currentCohort%c_area 
        
                    ipft=currentCohort%pft 
        
                    write(fates_log(),*) 'maxh:',prt_params%allom_dbh_maxheight(ipft) 
        
                    write(fates_log(),*) 'lmode: ',prt_params%allom_lmode(ipft) 
        
                    write(fates_log(),*) 'd2bl2: ',prt_params%allom_d2bl2(ipft) 
        
                    write(fates_log(),*) 'd2bl_ediff: ',prt_params%allom_blca_expnt_diff(ipft) 
        
                    write(fates_log(),*) 'd2ca_min: ',prt_params%allom_d2ca_coefficient_min(ipft) 
        
                    write(fates_log(),*) 'd2ca_max: ',prt_params%allom_d2ca_coefficient_max(ipft) 
        
                    currentCohort => currentCohort%shorter 
        
                 enddo 
        
                 call endrun(msg=errMsg(sourcefile, __LINE__)) 
        
              end if 
        
           enddo ! do while(area_not_balanced)

This hit an endrun during the restart here almost immediately after restart initialization:

fates/biogeochem/EDCanopyStructureMod.F90

Lines 1421 to 1427 in a3048a6

    
           if ( currentPatch%total_canopy_area>currentPatch%area ) then 
        
              if ( currentPatch%total_canopy_area-currentPatch%area > 0.001_r8 ) then 
        
                 write(fates_log(),*) 'FATES: canopy area bigger than area', & 
        
                      currentPatch%total_canopy_area ,currentPatch%area, & 
        
                      currentPatch%total_canopy_area -currentPatch%area,& 
        
                      currentPatch%nocomp_pft_label 
        
                 call endrun(msg=errMsg(sourcefile, __LINE__))

cheyenne folder: /glade/u/home/glemieux/scratch/ctsm-tests/tests_issue1051_d110_f45_areabalavoid

rgknox · 2023-10-13T17:37:19Z

@glemieux , are you saying that the model crashes if we don't call that promotion/demotion code during the restart sequence?

rgknox · 2023-10-13T17:42:31Z

my tests are now passing when bypassing canopy_structure() via update_site() on the restart, as well as adding patch%ncl_p to the restart

rgknox · 2023-10-13T19:14:15Z

Can someone else test these branches to see if these fixes work for them? I'll try some more tests, longer tests to see if that works:

https://github.com/rgknox/ctsm/tree/fates-nocomp-fix

The fates branch was added to the externals of the clm branch btw.

https://github.com/rgknox/fates/tree/nocomp-fix-tests

fates-side diff for convenience: https://github.com/NGEET/fates/compare/main...rgknox:fates:nocomp-fix-tests?w=1

glemieux · 2023-10-13T19:56:46Z

@glemieux , are you saying that the model crashes if we don't call that promotion/demotion code during the restart sequence?

@rgknox yes that's what I was seeing. This was with the latest ctsm tag using sci.1.67.2_api.27.0.0.

glemieux · 2023-10-13T20:37:15Z

@rgknox I confirmed that the branch fix worked for me as well. This was a 110, f45 test on Cheyenne.
Cheyenne folder: /glade/u/home/glemieux/scratch/ctsm-tests/tests_issue1051_d110_f45_rgknox-nocompfix

rosiealice · 2023-10-14T05:04:47Z

Amazing!

mvdebolskiy · 2023-10-14T13:06:05Z

Checked that init_cold is not called on restart. Will try to test rgknox's fix with longer test over the weekend.

mvertens · 2023-10-14T18:06:26Z

Fantastic news!!!! lør. 14. okt. 2023 kl. 15:06 skrev mvdebolskiy ***@***.***>:

…

Checked that init_cold is not called on restart. Will try to test rgknox's fix with longer test over the weekend. — Reply to this email directly, view it on GitHub <#1051 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4XCE3FMKQPLBMQX6RJ2X3X7KE4TAVCNFSM6AAAAAA2F5426OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRSHA4DSMZWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mvdebolskiy · 2023-10-16T11:19:56Z

@rgknox I am a bit confused. You have added variables in FatesRestartInterfaceMod as r8 but in the FatesPatchMod they are integers.

rgknox · 2023-10-16T14:20:09Z

@mvdebolskiy , I believe you are correct, the patch%ncl_p variable should be stored as a "cohort_int" and be accessed via this%rvars(***)%int1d. Thanks for catching that. The compiler was doing us a favor and converting from real to int for us. I'll make the change

mvdebolskiy · 2023-10-17T16:36:48Z

Quick question:
Do I understand correctly that canopy layers 1 ... nclmax are going in the order from tallest to shortest?

rgknox · 2023-10-17T16:38:32Z

yes, layer 1 is the upper-most canopy layer, and then it works downwards.

rgknox · 2023-10-27T14:45:51Z

While #1098 seems to enable some long restarts to pass, it does not seem to address the whole problem.
I do generate fails for longer ERS tests, for example 11 month runs with full fates on an f45.
I'm planning a few different tests:

trying with fates_comp_excln -1 (ie strict rank based promotion/demotion) instead of probabilistic
again, removing deciduous for evergreen only
I may try with fixed biogeog/nocomp, (my full fates runs generate the restart fail), but I believe @mvdebolskiy might be generating restart compare fails with this setting already?
fire/no fire

rgknox · 2023-10-28T02:20:37Z

I believe that this call to canopy spread is also a problem for restarts: https://github.com/NGEET/fates/blob/main/main/EDMainMod.F90#L789

If you look at the spread calculation, it is an "incrementer", ie it will modify the spread a little more until the canopy goes all the way towards closed. This is problematic if it is called more times in the restart sequence than the original sequence. I turned this call off on the restart, and the ERS 11 month test passed. I'll add this change to the testing branch and continue to see if we can find more problems.

Other tests that also pass:

ERS_Lm15.f45_f45_mg37.I2000Clm51Fates.cheyenne_intel.clm-Fates
ERS_Lm25_P144x1.f10_f10_mg37.I2000Clm51Fates.cheyenne_intel.clm-FatesColdNoComp

rosiealice · 2023-10-30T08:09:49Z

Hi Ryan,

Nice catch, this sounds like a very viable culprit to me...

mvertens · 2023-10-30T08:17:15Z

Hi Ryan, Great work! Do we want to do a longer run to verify that restarts are good out to longer times? Mariana man. 30. okt. 2023 kl. 09:10 skrev Rosie Fisher ***@***.***>:

…

Hi Ryan, Nice catch, this sounds like a very viable culprit to me... — Reply to this email directly, view it on GitHub <#1051 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4XCE3IHXVIDI3UAE43EG3YB5OFPAVCNFSM6AAAAAA2F5426OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBUGY3TKOJUGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rgknox · 2023-10-30T14:29:35Z

@mvertens I think that's a great idea. I was wondering how the queue wall-time is calculated for the tests, to make sure that we don't push up against the maximum. And I was also playing around with maximizing run time while keeping the wall-time somewhat consistent with other tests. For instance in that Lm25 I tried, I doubled the core count by using the P144x1 setting, and that took around 1000 seconds (17 minutes), not too bad. I think that is a reasonable wall-time to include in the test suite. Any ideas/brainstorming are welcome.
I was also going to try creating a test that initializes the vegetation demographics from initialization files. I think this will go a long way towards making our tests more robust. If we do this, I'd like us to integrate the pull request that updates allometry (#1093) before we set the initialization files up.

rgknox · 2023-10-30T16:18:47Z

I see that we have control over wallclock time in testlist_clm.xml, so this is good

rgknox · 2023-10-30T18:11:18Z

New test prototype added: https://github.com/ESCOMP/CTSM/pull/2199/files#diff-e9fa716c208fbff28067b20a7bdb1b23ca47f77e28e587c691b486aeed9aca3bR1565

Looking for feedback

… next (PR #6018) This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051. Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include more run mode options for fates at a variety of resolutions and runtimes. [non-BFB] for FATES Fixes #5548

This pull requests updates the ed_update_site call in elmfates_interfacemod to pass a flag for when this procedure is called during restart. This update should be coordinated with NGEET/fates#1098, which addresses the long duration exact restart issue NGEET/fates#1051. Additionally this pull request resolves #5548 by expanding the fates regression test coverage to include more run mode options for fates at a variety of resolutions and runtimes. [non-BFB] for FATES Fixes #5548

mvdebolskiy mentioned this issue Jul 14, 2023

FATES-nocomp restart compare test failing NorESMhub/CTSM#31

Closed

rgknox mentioned this issue Oct 16, 2023

FATES restart fix for long runs ESCOMP/CTSM#2199

Merged

glemieux mentioned this issue Oct 16, 2023

Long run restart fix #1098

Merged

3 tasks

glemieux mentioned this issue Nov 3, 2023

Add testmods to fates regression test to increase coverage and include long-term exact restart fix E3SM-Project/E3SM#6018

Merged

rgknox closed this as completed in #1098 Nov 10, 2023

failure in ERS 3 month nocomp test #1051

failure in ERS 3 month nocomp test #1051

Comments

mvertens commented Jul 11, 2023

mvertens commented Jul 11, 2023

rosiealice commented Jul 11, 2023

glemieux commented Jul 11, 2023

mvertens commented Jul 11, 2023

glemieux commented Jul 11, 2023 • edited Loading

mvertens commented Jul 13, 2023

rgknox commented Jul 13, 2023

mvertens commented Jul 14, 2023

glemieux commented Sep 29, 2023 • edited Loading

mvertens commented Oct 3, 2023

mvdebolskiy commented Oct 3, 2023

ckoven commented Oct 3, 2023

mvertens commented Oct 4, 2023

rgknox commented Oct 4, 2023

rgknox commented Oct 4, 2023

rgknox commented Oct 4, 2023

rgknox commented Oct 4, 2023 • edited Loading

mvdebolskiy commented Oct 5, 2023

rosiealice commented Oct 5, 2023

rgknox commented Oct 5, 2023 • edited Loading

rgknox commented Oct 5, 2023

ckoven commented Oct 5, 2023 • edited Loading

mvdebolskiy commented Oct 10, 2023 • edited Loading

mvdebolskiy commented Oct 12, 2023 • edited Loading

rgknox commented Oct 12, 2023

rosiealice commented Oct 12, 2023

mvdebolskiy commented Oct 12, 2023

rgknox commented Oct 12, 2023

mvdebolskiy commented Oct 13, 2023

rgknox commented Oct 13, 2023

rosiealice commented Oct 13, 2023

mvdebolskiy commented Oct 13, 2023 • edited Loading

mvdebolskiy commented Oct 13, 2023 • edited Loading

glemieux commented Oct 13, 2023

rgknox commented Oct 13, 2023

rgknox commented Oct 13, 2023 • edited Loading

rgknox commented Oct 13, 2023 • edited Loading

glemieux commented Oct 13, 2023

glemieux commented Oct 13, 2023

rosiealice commented Oct 14, 2023

mvdebolskiy commented Oct 14, 2023

mvertens commented Oct 14, 2023 via email

mvdebolskiy commented Oct 16, 2023

rgknox commented Oct 16, 2023

mvdebolskiy commented Oct 17, 2023

rgknox commented Oct 17, 2023 • edited Loading

rgknox commented Oct 27, 2023 • edited Loading

rgknox commented Oct 28, 2023 • edited Loading

rosiealice commented Oct 30, 2023

mvertens commented Oct 30, 2023 via email

rgknox commented Oct 30, 2023

rgknox commented Oct 30, 2023

rgknox commented Oct 30, 2023

glemieux commented Jul 11, 2023 •

edited

Loading

glemieux commented Sep 29, 2023 •

edited

Loading

rgknox commented Oct 4, 2023 •

edited

Loading

rgknox commented Oct 5, 2023 •

edited

Loading

ckoven commented Oct 5, 2023 •

edited

Loading

mvdebolskiy commented Oct 10, 2023 •

edited

Loading

mvdebolskiy commented Oct 12, 2023 •

edited

Loading

mvdebolskiy commented Oct 13, 2023 •

edited

Loading

mvdebolskiy commented Oct 13, 2023 •

edited

Loading

rgknox commented Oct 13, 2023 •

edited

Loading

rgknox commented Oct 13, 2023 •

edited

Loading

rgknox commented Oct 17, 2023 •

edited

Loading

rgknox commented Oct 27, 2023 •

edited

Loading

rgknox commented Oct 28, 2023 •

edited

Loading