-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
irreproducable results in variable resolution #631
Comments
This is not a supported grid (that I know of). What is this grid? |
it's not a supported grid. it's an experimental grid / cutting edge science |
Do you have a log file (atm) from one of the runs we can look at? (are there no changes to the namelist?) Thanks |
/glade/scratch/jedwards/testRR_jul2022.001/run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311 |
Unsupported variable resolution setups are not stable out-of-the-box. You can see that if you search for "dt" in the atm.log file where theoretical estimates for stable time-steps are given. Hence we need to set the se_*split variables. |
I would expect that in this case it would fail every time. But it doesn't - In that same directory you will see a run |
@PeterHjortLauritzen I don't think this is a stability issue because for some tasks it runs (in fact, Isla has this case running over the last few days). Yes, this is an unsupported grid. But I think this issue needs to be looked at because it may be a system issue that impacts all variable-resolution configurations. |
In case it's useful, this is my case which is currently running... casedir:/glade/work/islas/cesm2_3_beta08/runs/f.e23.FAMIPfosi.ne0np4.NATL.ne30x8_t13.001 |
Oh OK ... (I would still recommend to decrease the dynamics and tracer time-steps by increasing se_rsplit; you are the experts here but I would expect a model that may be unstable and somehow manages to keep running to do weird things) |
se_rsplit is currently set to 3. What would you recommend we go to? I assume decreasing the dynamics time-step is going to make the model run a lot slower? Robb has been running experiments with this grid for a while and I don't think anything too peculiar happened. |
(Peter - w/ var-res I try to run w/ the same dt's as in an equivalent global uniform res run. the atm.log dt metrics are never happy with my approach, but so far this has yielded stable runs for everyone I've advised on var-res time-steps.) Let's not get distracted from the main issue! |
OK. Apologies for derailing the detective work ... |
So, just to clarify, there is no need to change the se_rsplit? I'm restarting anyway because there was an output issue... |
I would not recommend changing se_rsplit, or any of the time-stepping. Robb and I have tested these settings extensively. |
Ok, sounds good. |
Here is one of my cases that has failed /glade/work/islas/cesm2_3_beta08/runs/testRR_jul2022.001 although I think this is identical to the one that Jim posted above. |
I have seen an issue that might be the same. I've been using the same tag as @islasimpson, but with a different grid (refined tropical belt). The run was crashing on SHR_REPROSUM_CALC just like above. The "solution" seemed to be to start from analytic initial conditions, which allowed the run to get started and completed my 1-day test. Here is the case directory: /glade/work/brianpm/my_cases/test_cases/c2p3b8.f2000climo.trbelta.001 In the current state, this case is using the analytic ic. This is the same grid that @jtruesdal has been testing, and might have seen the same issue. |
@brianpm - was your failure without the analytic initial condition repeatable or intermittent? |
I don't know. With analytic initial conditions the run successfully started. With initial conditions derived from regridding with Patrick's VR tools, I was seeing a failure, but I don't know if was actually repeatable. I saw it on several attempts, as I was trying to work through the case and get it running (with input from @adamrher). |
I'm fairly confident that Brian's issue with the TRBLT grid is repeatable. I think it was an unstable initial condition, that was resolved by running w/ analytic initial conditions. So my guess is it is not related to this issue, which is characterized by intermittent failures for an identical set of settings. @jtruesdal mentioned that he may have gotten intermittent errors with the TRBLT grid, though. But so far only Isla's NATL grid can reproduce this result. @patcal suggested he may have had a similar issue with various var-res configurations. |
My TRBELT case is /glade/p/cgd/amp/jet/cases/F2000climo.ne0np4.trbelta.ne30x8_g17.intel.1080pes.chey.nuopc.cesm23alpha09d.001.dbg My errors look to be a bad read of the initial conditions file. Right after calling read_inidat and doing a boundary exchange the prognostic fields contain some bad values. For SE the min/max of the initial state is printed and shows the bad values. This from the atm.log STATE DIAGNOSTICS
U -0.932687431112143+170 0.125020681873722E+03 I have debug print in the cesm.log file showing the locations of the bad values. |
@jtruesdal this looks an awful lot like the errors @renerwijn was getting in our new dual-polar var-res grid. The first print out of these stats, during the initialization phase nstep=0:
When I first saw this I was like what could possible be causing the state to go berserk? These don't resemble the values in the ncdata file. However, Rene can correct me, but the ncdata file turned out to be the problem ... or it at least, it motivated us to run the us standard atmos analytical inic for 4 weeks and spit out a new cam.i file. That cam.i file ended up being stable and not giving the egregious winds at nstep=0. So that anecdote makes me wonder whether it's just an unstable inic that yields this crazy state at nstep=0? The dycore had to have done something to the state at this point, because the ncdata file is on the dynamics grid, right? Is it doing more than just reading in the data at nstep=0? |
@adamrher This is printed out after the initial file is read and before dynamics runs. There is some initialization of derived quantities and mucking with edge buffers but I don't think the state is modified before the print. I will try the analytic init as suggested and create a new initial condition. I guess there could still be some corruption or incompatibility in the NetCDF initial file I'm using. |
@adamrher The analytic IC worked as did a restart from that run. Unfortunately using the IC produced by the analytic run exhibited the same behavior as before, sometimes working but most of the time reading an assortment of bad values. The failures show up under the STATE DIAGNOSTIC print in the atm log file and are garbage values not NaN's or INF's. Jim's test also has a bad state. The fields are read via infld and the errors seem to be confined to the 3d fields. The garbage values are intermingled with reads of good values on numerous processors. Maybe @gold2718 was right when thinking that the variable resolution data is exposing an issue in infld. |
indeed, from Jim's intermittent failure run at the top of this thread:
So we have three separate var-res configurations that are able to reproduce this error. At least we're converging on the issue... |
Ill test my case this afternoon and report back. |
@adamrher - Also if you can give us the details on the three tests you want to add to the regression tests, we can work on including them? We can work the details on this offline if needed. |
@cacraigucar regarding the three tests, these seem reasonable to me:
I would set the walltime to 30 minutes, to make sure it will still run when we double our vertical resolution in FHIST runs to L58. I defer to the se's on whether ERP is the best test (if we only get to choose one) ... I'm just more familiar with it. The only var-res we have now are FW (WACCM) tests, so I think it will be good to have these less complex, but arguably more common compsets working for all three grids. I will note that currently the CONUS will not run out of the box w/ FHIST because at least one emission file does not have year 1979 data in it (I suspect this is because that ACOM folks like to run CONUS with short nudged runs in a more recent year, and didn't bother to make the emissions work for 1979. Note that for arctic and greenland var-res grids, the emissions files are not on the native grids, which means they are interpolated on the fly from probably f09 files. ACOM likes to have emissions on the native grids for hi-res (I'm less picky). So I think to resolve this issue we should just ask ACOM to extent their CONUS emissions files to include year 1979. |
[people are asking for me to explicitly state the pio and ccs_versions needed for this fix: update externals to pio2_5_9 and the current head of ccs_config (ccs_config_cesm0.0.44, I believe)] |
The TRBELT tests worked. I updated pio and the ccs_config manually and have finished a few runs to completion. I also verified a restart run using the global integrals from the log. Everything completes and matches. This looks good from my end. |
regression tests on cheyenne indicate baseline answer changes (which is not expected). @jedwards4b has the following summary: I can confirm that there is an answer change when I use the new tags. I'm still looking for something in between. |
Updating to esmf-8.3.0-ncdfio-mpt-O also causes an answer change. |
Updating to esmf-8.3.0b13-ncdfio-mpt-O also fails baseline compare. |
Using esmf-8.3.0b07 passes baseline (also using pio2.5.9) |
Should we reach out to the ESMF team to ask if there are any expected answer changes? If I recall from your test failures they were all either cam-se or mpas -- all unstructured grids. I recall maybe Bob Oehmke saying that a fix was made to mapping algo for unstructured grids a while back and to switch to a more recent library. Or maybe it was something else ... |
@adamrher Yes I am working with the ESMF team. |
To document this here, answer changes were seen in the following CAM regression tests: The ESMF team is working to identify the cause of the answer changes |
@jedwards4b - I have a test which is also flat out failing. After thinking it might be a cheyenne hiccup, it keeps failing in the exact same way. The bottom of the cesm log file is:
The latest job can be seen at: Note that the ONLY changes are the ccs_config and pio external updates to ccs_config_cesm0.0.45 and pio2_5_9 @fischer-ncar - have you encountered this as well? |
Nope, I haven't seen this error. I'll try to reproduce your error with my latest alpha10a sandbox. |
@cacraigucar I'm pretty sure that the problem here is your pelayout of 384x3 since 384 is not an even multiple of |
The 384x3 pelayout places 12 MPI tasks on cheyenne node. This evenly spreads across 32 nodes. |
I was able to get this test to pass using the current alpha10a sandbox. This is using 384x3. From what you're using. The alpha10a sandbox has updates to cdeps, cmeps, cice6, ctsm, cime, cpl7, and share. |
I also had no problem running this test with the original pe-layout. |
I checked out a new attempt of the branch as it is currently stored (to make sure it wasn't corrupted somehow) and ran create_test on it. I still get the same results, so there must be something different between @jedwards4b setup and mine. My code base is at: The failed test is at: It is also worth reiterating that I am only changing pio and ccs_config externals. This test worked fine in all previous CAM tags |
I tried again and passed again. I see this difference in our cases:
Looking at the git log confirms that you are testing an older version of cam. |
When I went to SRCROOT, I got the following: cheyenne3$ git diff cam6_3_078 | less [parallelio] Also manage_externals/checkout_externals --status indicate that it was all clean Which git log is the one saying I'm using an older version of cam? (i.e. what directory were you in when you executed the command)? |
The answer changes that we are seeing with these changes are due to ESMF. Here is Mariana's explanation of what is causing the differences from a separate email exchange:
Based on this information @adamrher, @cacraigucar and Robert Oehmke have all signed off on the differences |
What happened?
We are seeing intermittent failures of the compset FHIST at resolution ne0np4.NATL.ne30x8_t13
I tried twice at NTASKS=3600 one failed on startup and one ran successfully.
Isla tried at NTASKS=3600 and had two successful runs and one failure.
Isla tried at NTASKS=5400 and had a similar failure. I tried at that resolution and had a successful run.
All this to say that I suspect there may be a race condition type problem here and it seems that this compset should be tested more.
What are the steps to reproduce the bug?
./create_newcase --compset FHIST --res ne0np4.NATL.ne30x8_t13 --case $CASENAME --mach cheyenne --run-unsupported
cd $CASENAME
./xmlchange NTASKS=3600
./case.setup
./case.build
./case.submit
(maybe that'll work, maybe it won't)
What CAM tag were you using?
cam6_3_052 (cesm2_3_beta08)
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
/glade/scratch/jedwards/testRR_jul2022.001
Will you be addressing this bug yourself?
No
Extra info
No response
The text was updated successfully, but these errors were encountered: