Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart failed with error "longwave down is 0 or negative" #935

Closed
XiulinGao opened this issue Nov 9, 2022 · 44 comments
Closed

Restart failed with error "longwave down is 0 or negative" #935

XiulinGao opened this issue Nov 9, 2022 · 44 comments

Comments

@XiulinGao
Copy link
Contributor

Hi FATES team, I'm running a regional simulation at 9kn spatial resolution with masked land units (only in grasslands) on Cheyenne using NUOPC driver. It's a 40-year-long simulation with resubmit, but model failed right away after restart with error pointing to a negative longwave radiation. I did following steps to identify the problems:

  1. test the forcing data for the specific year where model failed and no complain about negative longwave radiation, so conclude no bad data
  2. restart using another restart file and failed right away with the same issue, conclude that it might be a restart issue.

Some background info for this regional case can be found here: ESCOMP/CTSM#1773

I have posted this on CESM forum, but think might also post it here just in case anyone has ran into a similar issue before and have already solved it.

Thanks!

@rgknox
Copy link
Contributor

rgknox commented Nov 9, 2022

@XiulinGao could you link that post on the CESM forum you mentioned? thanks

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 9, 2022

How are you specifying the longwave (LW) down from your forcing data? Is LW down one of the provided fields, or are you using other fields to calculate LW down? What's the list of fields in the forcing data?

@XiulinGao
Copy link
Contributor Author

@rgknox Ryan: here is the link to the post https://bb.cgd.ucar.edu/cesm/threads/longwave-down-is-zero-or-negative-error-replaced-default-gswp3-forcing-with-wrf-forcing-at-hourly-scale.7784/

@ekluzek Erik: LW is specified in the forcing by the variable FLDS, I processed forcing data according to the requirement for CLM1PT forcing format, so the data is not same as GSWP3. Forget to mention I'm running this under GSWP3v1 datm mode but replaced all default forcing using the WRF data, I have suspected if datm streaming setting need to be changed to reflect a higher time resolution of the wrf forcing, which is at one-hour interval.

@jkshuman
Copy link
Contributor

Can you try running this with a different set of years to see if you get the same fail?
We had another user have a similar error for a CTSM single point case.

@XiulinGao
Copy link
Contributor Author

@jkshuman I re-configured my PES layout to finish a single run within 2 hours to go pass this restart error now. Just let you know that there's no problem finishing the run but results are totally wrong. I went check the output forcing variables, and see that both rain and longwave radiation lost their spatial variations during the run, see attached figures. I'm not sure if this is the root of the issue, but this is totally wrong no matter what. I also attached precipitation plot for 1981-01 from the forcing I'm using, which definitely shows strong spatial pattern.
rain-simu

flds

fsds

temp-2m

1981-01-rain

@ashehad
Copy link

ashehad commented Nov 14, 2022

Maybe you can try doing some spatial simulations using the original NCEP or CRUNCEP data, but let CLM downscale it to 9 km by prescribing it in the user_nl_datm.

@XiulinGao
Copy link
Contributor Author

Maybe you can try doing some spatial simulations using the original NCEP or CRUNCEP data, but let CLM downscale it to 9 km by prescribing it in the user_nl_datm.

To do that, I should set mapalgo to bilinear, add a mesh file (my domain of interest?), and no need to change defualt forcing file? I'm not sure how to do that

@ashehad
Copy link

ashehad commented Nov 14, 2022

Yes, set mapalgo to bilinear. For the domain of interest - we normally obtain it when we construct the surface file. And, keep the default forcing file as it is.

@XiulinGao
Copy link
Contributor Author

XiulinGao commented Nov 14, 2022

Update: I followed Ashehad's suggestion to run the simulation, however, still with WRF forcing ;), but instead setting magalgo to 'none' and meshfile to 'none', I set them to 'bilinear' and added my mesh file, just by checking the first few output files, RAIN variable now is back to normal with varying values for cells. I'll follow up on this after I run with restart and see if that fails again. Also, does nearest neighbor makes more sense? as i'm using 9km forcing for a 9km simulation so nn spatial interpolation make more sense?

@XiulinGao
Copy link
Contributor Author

Longwave radiation error still persists after I changed the forcing setting, so that's not the issue. I'll try to use default GSWP3 forcing to see if restart also fails.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Dec 7, 2022

In case this helps...
I encountered the "longwave down is <= 0" error while dismantling CTSM code for SLIM in this PR: ESCOMP/SimpleLand#46

In my case, the error is triggered at startup (not restart) when I remove this line
lnd2atm_inst%t_rad_grc(g) = sqrt(sqrt(lnd2atm_inst%eflx_lwrad_out_grc(g)/sb))
from lnd2atmMod.F90, which means that lnd2atm_inst%t_rad_grc(g) = 0 from initialization.

@XiulinGao
Copy link
Contributor Author

In case this helps... I encountered the "longwave down is <= 0" error while dismantling CTSM code for SLIM in this PR: ESCOMP/SimpleLand#46

In my case, the error is triggered at startup (not restart) when I remove this line lnd2atm_inst%t_rad_grc(g) = sqrt(sqrt(lnd2atm_inst%eflx_lwrad_out_grc(g)/sb)) from lnd2atmMod.F90, which means that lnd2atm_inst%t_rad_grc(g) = 0 from initialization.

thanks Sam. I tried to take this line of code out and it still gives the same error. I did a run with debug mode 1, here is something I see in the cesm log file, any thoughts?

Screen Shot 2023-03-03 at 12 46 11 PM

@XiulinGao
Copy link
Contributor Author

I looked at the datm repointer file, it occurs that in the Date variable, there are a lot of negative values have no meanings to me.
Screen Shot 2023-03-03 at 2 28 41 PM

@slevis-lmwg
Copy link
Contributor

In case this helps... I encountered the "longwave down is <= 0" error while dismantling CTSM code for SLIM in this PR: ESCOMP/SimpleLand#46
In my case, the error is triggered at startup (not restart) when I remove this line lnd2atm_inst%t_rad_grc(g) = sqrt(sqrt(lnd2atm_inst%eflx_lwrad_out_grc(g)/sb)) from lnd2atmMod.F90, which means that lnd2atm_inst%t_rad_grc(g) = 0 from initialization.

thanks Sam. I tried to take this line of code out and it still gives the same error. I did a run with debug mode 1, here is something I see in the cesm log file, any thoughts?

Sorry for the confusion. I didn't mean to suggest removing that line of code. In my case, removing that line of code triggered the error at startup, which means (in my case) that lnd2atm_inst%t_rad_grc(g) = 0 from initialization. If my case suggests anything about your case, and if I still remember my reasoning from my Dec 7th post, then I was thinking that you may also be ending up with lnd2atm_inst%t_rad_grc(g) = 0 but in your case maybe due to lnd2atm_inst%eflx_lwrad_out_grc(g) = 0 at restart...

Beyond this speculation, I would have to try to reproduce your error and try to debug it. Please let me know @XiulinGao if you would like me to do that, and I will follow up with you.

@rgknox
Copy link
Contributor

rgknox commented Mar 6, 2023

@XiulinGao could you provide the full logs? In particular, I'm curious about the cesm, lnd and datm logs

@XiulinGao
Copy link
Contributor Author

In case this helps... I encountered the "longwave down is <= 0" error while dismantling CTSM code for SLIM in this PR: ESCOMP/SimpleLand#46
In my case, the error is triggered at startup (not restart) when I remove this line lnd2atm_inst%t_rad_grc(g) = sqrt(sqrt(lnd2atm_inst%eflx_lwrad_out_grc(g)/sb)) from lnd2atmMod.F90, which means that lnd2atm_inst%t_rad_grc(g) = 0 from initialization.

thanks Sam. I tried to take this line of code out and it still gives the same error. I did a run with debug mode 1, here is something I see in the cesm log file, any thoughts?

Sorry for the confusion. I didn't mean to suggest removing that line of code. In my case, removing that line of code triggered the error at startup, which means (in my case) that lnd2atm_inst%t_rad_grc(g) = 0 from initialization. If my case suggests anything about your case, and if I still remember my reasoning from my Dec 7th post, then I was thinking that you may also be ending up with lnd2atm_inst%t_rad_grc(g) = 0 but in your case maybe due to lnd2atm_inst%eflx_lwrad_out_grc(g) = 0 at restart...

Beyond this speculation, I would have to try to reproduce your error and try to debug it. Please let me know @XiulinGao if you would like me to do that, and I will follow up with you.

yes please! here is the case directory: /glade/work/xiugao/Regional-WRF/Simulations/oak-grass_restart_CLM_FATES. If you look at the simulation directory /glade/scratch/xiugao/Regional-WRF/Simulations/oak-grass_restart_CLM_FATES/run and try to see the datm restart file you can see that the first nt (nstreams, nfiels) variable has a lot of zeros, which is different from a interval restart file that is saved during the run (see /glade/scratch/xiugao/Regional-WRF/Simulations/regional-avba-N08T27_CLM_FATES/run/regional-avba-N08T27_CLM_FATES.datm.r.2021-01-01-00000.nc for example), which makes me suspect that it might be the reason causing the error message.

@XiulinGao
Copy link
Contributor Author

Longwave radiation error still persists after I changed the forcing setting, so that's not the issue. I'll try to use default GSWP3 forcing to see if restart also fails.

did simulation with GSWP3 forcing and restart was successful, confirming that this restart issue might be specific for the WRF forcing and domain we are using.

@XiulinGao
Copy link
Contributor Author

@XiulinGao could you provide the full logs? In particular, I'm curious about the cesm, lnd and datm logs

Ryan, I'll get back to you later once I have the detailed logs when turn on debug, the logs were overwritten when I switched back to non-debug. When running under non-debug, the error message is only 'long wave radiation sent from atm model is negative or zero' without details.

@XiulinGao
Copy link
Contributor Author

@XiulinGao could you provide the full logs? In particular, I'm curious about the cesm, lnd and datm logs

Here are the atm, cesm and lnd logs: https://drive.google.com/drive/folders/1Zhc4SkJeYaMwHlKsi4WmS_Ec29F61rvs?usp=sharing

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Mar 11, 2023

From @XiulinGao's copy of the CTSM I replicated her restart failure as follows:
./create_clone --clone /glade/work/xiugao/Regional-WRF/Cases/oak-grass_restart_CLM_FATES/ --case ~/cases_FATES/oak-grass_CLM_FATES_clone --cime-output-root /glade/scratch/slevis
...and used the /SourceMods directory in the cloned case to change the code. This way I added these write statements before the failure:

g, forc_lwrad_not_downsc... =         1  0.000000000000000E+000
g, latdeg, londeg =   32.5897750854492       -116.365264892578
 ERROR:
 (cpl:utils:check_for_errors) ERROR: Longwave down sent from the atmosphere model is negative or zero
  • I have confirmed that the WRF forcing does not contain zeros (see /glade/work/xiugao/fates-input/ca-wrf-grassland/CLM1PT_data/9km_California_v1-1_c20230301/1981-01.nc).
  • Xiulin tried the simulation with GSWP3 forcing and the restart was successful (as she stated in an earlier post).
  • I tried a simulation with the same grid but not sparse (ie, mask = land everywhere) and got the same error at restart.
  • I did an initial start with finidat equal to the restart file instead of doing a restart. This one works, which tells me that datm or some other non-land code has trouble handling the curvilinear grid at restart. I would be happy to look at the datm with somebody more experienced than I to try to address this.

Meanwhile though, I believe an initial run can give same answers as a restart as long as one updates the start dates and finidat correctly. Xiulin, I will confirm, and then we can discuss how you can do this, because I see it as your only immediate option, while restarts do not work for curvilinear grids.

@ekluzek do you know of others using curvilinear grids that we could ask for guidance, or is Xiulin the first?

@slevis-lmwg
Copy link
Contributor

@XiulinGao I am almost there, and I hope @ekluzek may have feedback:
I compare yr-2 history from a two-year startup run versus an initial run that only repeats year 2. In the first month of yr 2 I see diffs in the forcing variables TBOT, QBOT, FSDS, etc. which cause diffs throughout. In subsequent months I see NO diffs in the forcing variables and all other variables become less different. I think this may be because the initial run (ie, the second one) runs an extra timestep at the beginning.

@ekluzek I think a branch simulation will get past this problem. I also vaguely recall that there's a way to start a branch without pointing to the datm restart file, right? I will look into it.

@XiulinGao if you wanted to try using initial runs (manually) for now, you could see how I did it in this case (otherwise you could wait until next week when I hope to figure out the "branch" solution):
/glade/u/home/slevis/cases_FATES/oak-grass_CLM_FATES_clone
For the 2-yr startup simulation I did:

cp env_run.xml_first env_run.xml
cp user_nl_clm.first user_nl_clm

For the initial run that repeated year 2, I did:

cp env_run.xml_continue env_run.xml
cp user_nl_clm.continue user_nl_clm

So each time that you need to restart, you need to update user_nl_clm and env_run.xml similar to how I did for my "continue" simulation.

@slevis-lmwg
Copy link
Contributor

...I may have been wrong about branch runs:

  • branch failed
  • hybrid worked but gave same results as the earlier attempt, so it probably starts one timestep off

@slevis-lmwg
Copy link
Contributor

Update:
I had a quick meeting with @mvertens where she noticed that the "atm" dimensions seem wrong in the cpl.r file:

netcdf oak-grass_restart_CLM.cpl.r.1901-01-01-05400 {
dimensions:
        time = UNLIMITED ; // (1 currently)
        ntb = 2 ;
        atmImp_nx = 1 ;
        atmImp_ny = 1 ;
        atmFrac_nx = 1 ;
        atmFrac_ny = 1 ;
        lndImp_nx = 147 ;
        lndImp_ny = 151 ;
        lndExp_nx = 147 ;
        lndExp_ny = 151 ;
        lndFrac_nx = 147 ;
        lndFrac_ny = 151 ;

@mvertens suggested a short run to check whether the same is true in a cpl.h file generated after restart. The answer is yes:

netcdf oak-grass_restart_CLM.cpl.hi.1901-01-01-07200 {
dimensions:
        time = UNLIMITED ; // (1 currently)
        ntb = 2 ;
        atmImp_nx = 1 ;
        atmImp_ny = 1 ;
        Med_frac_atm_nx = 1 ;
        Med_frac_atm_ny = 1 ;
        MED_atm_nx = 1 ;
        MED_atm_ny = 1 ;
        lndImp_nx = 147 ;
        lndImp_ny = 151 ;
        lndExp_nx = 147 ;
        lndExp_ny = 151 ;
        Med_frac_lnd_nx = 147 ;
        Med_frac_lnd_ny = 151 ;
        MED_lnd_nx = 147 ;
        MED_lnd_ny = 151 ;

@mvertens
Copy link

mvertens commented Mar 16, 2023

(editing my handle to @slevisconsulting because there is some unknown to me person out there who has responded in the past to the handle that you used for me :-))

@slevisconsulting - in looking at your $CASEROOT - I see the following:

$ ./xmlquery -p NX
ATM_NX: 1
GLC_NX: 0
ICE_NX: 0
LND_NX: 1
OCN_NX: 0
ROF_NX: 0
WAV_NX: 0
$ ./xmlquery -p NY
ATM_NY: 1
GLC_NY: 0
ICE_NY: 0
LND_NY: 1
OCN_NY: 0
ROF_NY: 0
WAV_NY: 0

CTSM and DATM treat these variables differently.
in lnd_comp_nuopc.F90 - CTSM sets tne nx and ny values used in the mediator in the following:

    ! Set scalars in export state
    call State_SetScalar(dble(ldomain%ni), flds_scalar_index_nx, exportState, &
         flds_scalar_name, flds_scalar_num, rc)
    call State_SetScalar(dble(ldomain%nj), flds_scalar_index_ny, exportState, &
         flds_scalar_name, flds_scalar_num, rc)

So even thought LND_NX and LND_NY are 1 - those values are ignored and the domain values are sent instead:

DATM on the other hand uses the ATM_NX and ATM_NY values that are in nuopc.runconfig (that are obtained from the xml variables ATM_NX and ATM_NY) - which are both 1.

the corresponding nx and ny values are used by the mediator to write out 2d history and restart output and determines the coordinate axis that would other wise be a 1d unstructured list.

My suggestion would be to set the xml variables
ATM_NX and LND_NX to 147
ATM_NY and LND_NY to 151

and see if this resolves the restart problem.

We need to understand why the xml variables are 1 and then find a way to set them correctly.
Let me know if this works.

@slevis-lmwg
Copy link
Contributor

My suggestion would be to set the xml variables ATM_NX and LND_NX to 147 ATM_NY and LND_NY to 151
and see if this resolves the restart problem.

We need to understand why the xml variables are 1 and then find a way to set them correctly. Let me know if this works.

Good news @mvertens @XiulinGao
I changed the four xml variables in env_build.xml as Mariana suggested. I ran a 1-d cold start and a 1-d restart, and everything worked. Thank you!

@mvertens pls let me know if you have a recommendation for a way to correct this problem in the scripts. Should we open a github issue and, if so, under CTSM or elsewhere?

@mvertens
Copy link

@slevis - great news that it worked!
I think you need to look in component_grids_nuopc.xml and determine if and are set correctly for the sparse grid.
I think that would be the place to fix things. Happy to talk again if that would help.

@slevis-lmwg
Copy link
Contributor

@mvertens
Xiulin and I have been starting these curvilinear-sparse-regional cases with
./create_newcase --res=CLM_USRDAT ...
If I'm understanding your recommendation, one could add a definition for this 147x151 grid in component_grids_nuopc.xml by replicating the pattern that appears there for other grids.

@slevis-lmwg
Copy link
Contributor

@XiulinGao to try the above suggestion, edit the file
~xiugao/CTSM/ccs_config/component_grids_nuopc.xml

@slevis-lmwg
Copy link
Contributor

...then create a new case with --res=147x151 (or whatever you happen to name the domain in component_grids_nuopc.xml)

@XiulinGao
Copy link
Contributor Author

Awesome! Thank you both for the insightful comments and solutions! I tried to define a grid resolution (WRF-SPARSE) in component_grids_nuopc.xml, but failed with error message saying

"Compset specification file is /glade/u/home/xiugao/CTSM/cime_config/config_compsets.xml
Automatically adding SESP to compset
ERROR: no alias WRF-SPARSE defined"

I wonder if defining a new grid resolution involves not only modification in component_grids_nuopc.xml but also changes in other places.
But I did successfully run 2 days of simulations with restart by changing ATM_NX=147, ATM_NY=151, LND_NX=147, and LND_NY=151 when setting up the case. I'll look into whether the outcome makes sense or not by running longer simulations. Not sure if running a regional simulation suing predefined grid resolution for point simulation will mess up model outcomes or not.

@slevis-lmwg
Copy link
Contributor

@XiulinGao I'm glad that you restarted successfully by changing env_build.xml

Regarding the suggestion to modify component_grids_nuopc.xml, you may be right that additional scripts need modification. Let's put that on hold and, instead, I will follow up here with this question for @jedwards4b:

Jim,

  • quick summary: This was a failing restart because the user created the case with --res=CLM_USRDAT and datm defaulted to using ATM_NX=1 ATM_NY=1 from env_build.xml rather than the values in her mesh file (as posted here).
  • the question: Would it make sense to modify the datm code to always pick up these values from the mesh file (as already done for the corresponding nx ny for lnd)? If so, who handles datm code updates? Are you the right person to contact?

Thanks!

@jedwards4b
Copy link

@slevisconsulting Yes I am the correct contact. Can you open an issue in CDEPS and provide a test along with the desired outcome.

@jedwards4b
Copy link

@XiulinGao the information you have provided here is incomplete. It would help me if you could show the modification that you made to component_grids_nuopc.xml and the command that generated the error.

@slevis-lmwg
Copy link
Contributor

@jedwards4b I posted a new issue in CDEPS so that you may replicate the error.

@XiulinGao
Copy link
Contributor Author

XiulinGao commented Mar 21, 2023

@XiulinGao the information you have provided here is incomplete. It would help me if you could show the modification that you made to component_grids_nuopc.xml and the command that generated the error.

I think I actually figured out how to do it without manually editing ATM_NX ATM_NY LND_NX LND_NY.
there are two steps involved:

Screen Shot 2023-03-21 at 11 54 00 AM

when build the model, specify res=147x151_california. No need to xmlchange mesh file for lnd and atm, but still have to point to the mask file for sparse grid run.

But I agree with Sam's suggestion to automatically read the dimension from mesh file would make things easier.

@jedwards4b
Copy link

I guess I don't see how you are proposing to read the dimensions from the mesh file?
If you provide the change in component_grids_nuopc.xml then you have the mesh file, but you don't need it because the dimensions are already there. If you don't have that change then how do you propose the mesh file be provided?

@XiulinGao
Copy link
Contributor Author

you can specify mesh file when building the case by ./xlmchange ATM_DOMAIN_MESH & LND_DOMAIN_MESH
that's why we actually run a regional case under a predefined single point grid res....but the restart gets confused and thinks it's 1x1 res

@jedwards4b
Copy link

I understand now, thanks.

@jedwards4b
Copy link

So the method used here is what you should do. I think that CLM_USRDAT should not be used in this case.

@XiulinGao
Copy link
Contributor Author

Thank you! curious how that would be different from running clm using CLM_USRDAT but manually change nx ny and mesh file. To me, those predefined grid resolution serve as a shortcut for setting up domain and mapping files? Which seems not necessary for nuopc as it does mapping on the fly? and we can easily assign mesh file and define nx ny?

@jedwards4b
Copy link

Sure you can do that - but having to make a bunch of changes after you define a case makes it hard to test that case in any automated testing.

@XiulinGao
Copy link
Contributor Author

right, that makes sense. Thanks for all the insightful discussions here, learned a lot.

@slevis-lmwg
Copy link
Contributor

@XiulinGao I added step G7 in discussion #1919 in this list of instructions, so as to include the above solutions to the restart issue. Feel free to make corrections or additions if you find it necessary.

Also, if this issue (#935) is resolved, then you may wish to close it.

@XiulinGao
Copy link
Contributor Author

Sounds good. Close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

8 participants