ERS tests at least one year long failing across multiple test mods #897

glemieux · 2022-08-30T22:36:56Z

In the course of debugging an issue that lead to the creation of #894, it was discovered that there was a subtler, secondary issue with exact restarts for test cases that are a year or longer (i.e. most ERS tests fail on COMPARE_base_rest). This currently appears somewhat different than the issue noted in ESCOMP/CTSM#667 (comment). The problem did not at first appear confined to anyone variable or any particular testmod, although the following has been discovered so far:

All variations on FatesColdDefReducedComplexSatPhen testmod are b4b
The 1x1_brazil grid resolution is b4b
no comp + fixed biogeo is not b4b
FatesColdDef is not b4b

Through testing a subset of the run modes I've found that FatesColdDefReducedComplexNoComp will run b4b if I comment out the call to trim_canopy, turn fire off, set nclmax = 1, and set test_zero_mortality = .true.. Trying the same setup with FatesColdDef will result in a failure on COMPARE_base_rest.

The current thread that I'm following is assessing the DIFFs for the former above test setup, but with trim_canopy on. I've found that both bc_in%h2o_liqvol_sl and tveg24 are varying on the final pass through the call to phenology. This seems to suggest to me that there might be some timing issue on the last model day of the year. This plus a number of diagnostic outputs for the restart variables, lends some confidence that this issue isn't in the restart initialization necessarily.

Also note that these test were run with #685.

The text was updated successfully, but these errors were encountered:

glemieux · 2022-09-01T23:22:23Z

Following the thread of the restart differences around tveg24 is pointing to problem being located inside the filter loop (within the leaf temperature iterative loop) on the last day of the year in this section of the CanopyFluxes code:

https://github.com/ESCOMP/CTSM/blob/56878b6a77e167c1c875aa9cabdf6ea2e482d737/src/biogeophys/CanopyFluxesMod.F90#L1233-L1257

I've confirmed through diagnostic outputs that the sum of the t_veg_patch values at the start of the loop and the end of loop are different across the restarts. Next I'm going to try and isolate the patches to see if there is a specific subset that is problematic (I'm currently narrowing the output to only patches at a known problematic fates site). In this way I hope to be able to better identify which of the multiple variable going into the dt_veg calculation may be causing the difference.

glemieux · 2022-09-12T18:19:47Z

Tracing the issue lead me through the host land model and back to fates in the trim_canopy check here:

fates/biogeochem/EDPhysiologyMod.F90

Lines 591 to 608 in 12ce31c

    
           if (currentCohort%year_net_uptake(z) < currentCohort%leaf_cost) then 
        
              ! Make sure the cohort trim fraction is great than the pft trim limit 
        
              if (currentCohort%canopy_trim > EDPftvarcon_inst%trim_limit(ipft)) then 
        
                 ! keep trimming until none of the canopy is in negative carbon balance. 
        
                 if (currentCohort%hite > EDPftvarcon_inst%hgt_min(ipft)) then 
        
                    currentCohort%canopy_trim = currentCohort%canopy_trim - & 
        
                         EDPftvarcon_inst%trim_inc(ipft) 
        
                    if (prt_params%evergreen(ipft) /= 1)then 
        
                       currentCohort%leafmemory = currentCohort%leafmemory * & 
        
                            (1.0_r8 - EDPftvarcon_inst%trim_inc(ipft)) 
        
                    endif 
        
                    trimmed = .true. 
        
                 endif ! hite check 
        
              endif ! trim limit check 
        
           endif ! net uptake check

At least part of the ERS issue is that year_net_uptake is not being carried over in the restart. Thus if the restart is kicked off mid-year, the yearly net uptake will be less than the base and some of the cohorts will avoid being trimmed in this check. Talking to @rgknox, ideally we would roll this fix in to #769, but we agreed that focusing on the fix is a priority. I will test this fix by adding the yearly uptake to the restart with the full compliment of every leaf layer for the time being.

glemieux · 2022-09-13T00:18:15Z

Adding year_net_uptake to the restart interface results in b4b restart runs for the few test mod and grid combinations that I have exercise so far, but only for tests that start on December 1 and runs through the end of the year. Extending out the total run time to 2 months (i.e. starting in November) or greater (e.g. starting on Jan 1 for a one year run) results in COMPARE_base_rest failure.

This suggests that the year_net_uptake is certainly part of the issue, but that there are likely multiple problems to contend with.

glemieux · 2022-09-19T18:47:42Z

I realized I made an error in my initial fix to add year_net_uptake to the restart file. Fixing the very simple, but memory intensive, implementation results in b4b restarts. Attempting a more complex restart using RegisterCohortVector subroutine did not result in b4b restarts however. Currently investigating if its my implementation or something else.

glemieux · 2022-09-20T19:29:35Z

The issue is not with the restart implementation method. I simply missed that I had taken out the nclmax change that I had been using during my investigations. So the current standing is that with nclmax = 1 a 13 month f10 nocomp tests with restart will return b4b results. Fully dynamic fates will not restart with b4b results however. If I reset nclmax to the default (2) then nocomp tests will return to failing the restart.

glemieux · 2022-11-22T17:53:09Z

For future reference, since this has moved down the priority list, the branch for adding the yearly_net_uptake is https://github.com/glemieux/fates/commits/restart-nlevleaf

glemieux · 2023-11-16T23:51:37Z

This appears to have been fixed by #1098.

glemieux self-assigned this Aug 30, 2022

glemieux added the bug - software engineering label Aug 30, 2022

glemieux mentioned this issue Aug 30, 2022

Add more FATES tests that are longer than one year ESCOMP/CTSM#1839

Closed

glemieux mentioned this issue Sep 12, 2022

Add long term exact restart test and fixed biogeog + no competition tests to fates suite ESCOMP/CTSM#1849

Merged

glemieux mentioned this issue Jul 11, 2023

failure in ERS 3 month nocomp test #1051

Closed

This was referenced Nov 16, 2023

Define new fates_param_reader_type type to abstract HLM-side param I/O #1096

Merged

Long run restart fix #1098

Merged

glemieux closed this as completed Nov 16, 2023

glemieux linked a pull request Nov 16, 2023 that will close this issue

Long run restart fix #1098

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERS tests at least one year long failing across multiple test mods #897

ERS tests at least one year long failing across multiple test mods #897

glemieux commented Aug 30, 2022 •

edited

Loading

glemieux commented Sep 1, 2022 •

edited

Loading

glemieux commented Sep 12, 2022 •

edited

Loading

glemieux commented Sep 13, 2022 •

edited

Loading

glemieux commented Sep 19, 2022

glemieux commented Sep 20, 2022 •

edited

Loading

glemieux commented Nov 22, 2022

glemieux commented Nov 16, 2023

ERS tests at least one year long failing across multiple test mods #897

ERS tests at least one year long failing across multiple test mods #897

Comments

glemieux commented Aug 30, 2022 • edited Loading

glemieux commented Sep 1, 2022 • edited Loading

glemieux commented Sep 12, 2022 • edited Loading

glemieux commented Sep 13, 2022 • edited Loading

glemieux commented Sep 19, 2022

glemieux commented Sep 20, 2022 • edited Loading

glemieux commented Nov 22, 2022

glemieux commented Nov 16, 2023

glemieux commented Aug 30, 2022 •

edited

Loading

glemieux commented Sep 1, 2022 •

edited

Loading

glemieux commented Sep 12, 2022 •

edited

Loading

glemieux commented Sep 13, 2022 •

edited

Loading

glemieux commented Sep 20, 2022 •

edited

Loading