irreproducable results in variable resolution #631

jedwards4b · 2022-08-03T17:58:40Z

What happened?

We are seeing intermittent failures of the compset FHIST at resolution ne0np4.NATL.ne30x8_t13

I tried twice at NTASKS=3600 one failed on startup and one ran successfully.
Isla tried at NTASKS=3600 and had two successful runs and one failure.

Isla tried at NTASKS=5400 and had a similar failure. I tried at that resolution and had a successful run.

All this to say that I suspect there may be a race condition type problem here and it seems that this compset should be tested more.

What are the steps to reproduce the bug?

./create_newcase --compset FHIST --res ne0np4.NATL.ne30x8_t13 --case $CASENAME --mach cheyenne --run-unsupported
cd $CASENAME
./xmlchange NTASKS=3600
./case.setup
./case.build
./case.submit

(maybe that'll work, maybe it won't)

What CAM tag were you using?

cam6_3_052 (cesm2_3_beta08)

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/scratch/jedwards/testRR_jul2022.001

Will you be addressing this bug yourself?

No

Extra info

No response

adamrher · 2022-08-03T18:12:54Z

@islasimpson

gold2718 · 2022-08-04T05:04:09Z

This is not a supported grid (that I know of). What is this grid?

adamrher · 2022-08-04T05:13:39Z

it's not a supported grid. it's an experimental grid / cutting edge science

PeterHjortLauritzen · 2022-08-04T14:13:26Z

Do you have a log file (atm) from one of the runs we can look at? (are there no changes to the namelist?) Thanks

jedwards4b · 2022-08-04T14:19:24Z

/glade/scratch/jedwards/testRR_jul2022.001/run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74:
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74:SHR_REPROSUM_CALC: Input contains 0.00000E+00 NaNs and 0.40000E+01 INFs on process 74
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74: ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input

PeterHjortLauritzen · 2022-08-04T15:06:55Z

Unsupported variable resolution setups are not stable out-of-the-box. You can see that if you search for "dt" in the atm.log file where theoretical estimates for stable time-steps are given. Hence we need to set the se_*split variables.

jedwards4b · 2022-08-04T15:09:32Z

I would expect that in this case it would fail every time. But it doesn't - In that same directory you will see a run
that succeeds at the same ntasks count with no changes in the model or model inputs.

adamrher · 2022-08-04T15:09:52Z

@PeterHjortLauritzen I don't think this is a stability issue because for some tasks it runs (in fact, Isla has this case running over the last few days).

Yes, this is an unsupported grid. But I think this issue needs to be looked at because it may be a system issue that impacts all variable-resolution configurations.

islasimpson · 2022-08-04T15:13:13Z

In case it's useful, this is my case which is currently running...

casedir:/glade/work/islas/cesm2_3_beta08/runs/f.e23.FAMIPfosi.ne0np4.NATL.ne30x8_t13.001
rundir:/glade/scratch/islas/f.e23.FAMIPfosi.ne0np4.NATL.ne30x8_t13.001/run

PeterHjortLauritzen · 2022-08-04T15:18:06Z

Oh OK ... (I would still recommend to decrease the dynamics and tracer time-steps by increasing se_rsplit; you are the experts here but I would expect a model that may be unstable and somehow manages to keep running to do weird things)

islasimpson · 2022-08-04T15:21:40Z

se_rsplit is currently set to 3. What would you recommend we go to? I assume decreasing the dynamics time-step is going to make the model run a lot slower? Robb has been running experiments with this grid for a while and I don't think anything too peculiar happened.

adamrher · 2022-08-04T15:26:06Z

(Peter - w/ var-res I try to run w/ the same dt's as in an equivalent global uniform res run. the atm.log dt metrics are never happy with my approach, but so far this has yielded stable runs for everyone I've advised on var-res time-steps.)

Let's not get distracted from the main issue!

PeterHjortLauritzen · 2022-08-04T15:59:22Z

OK. Apologies for derailing the detective work ...

islasimpson · 2022-08-04T16:01:51Z

So, just to clarify, there is no need to change the se_rsplit? I'm restarting anyway because there was an output issue...

adamrher · 2022-08-04T16:04:46Z

I would not recommend changing se_rsplit, or any of the time-stepping. Robb and I have tested these settings extensively.

islasimpson · 2022-08-04T16:13:12Z

Ok, sounds good.

islasimpson · 2022-08-09T15:22:54Z

Here is one of my cases that has failed /glade/work/islas/cesm2_3_beta08/runs/testRR_jul2022.001 although I think this is identical to the one that Jim posted above.

brianpm · 2022-08-09T15:25:59Z

I have seen an issue that might be the same. I've been using the same tag as @islasimpson, but with a different grid (refined tropical belt). The run was crashing on SHR_REPROSUM_CALC just like above. The "solution" seemed to be to start from analytic initial conditions, which allowed the run to get started and completed my 1-day test.

Here is the case directory: /glade/work/brianpm/my_cases/test_cases/c2p3b8.f2000climo.trbelta.001

In the current state, this case is using the analytic ic.

This is the same grid that @jtruesdal has been testing, and might have seen the same issue.

jedwards4b · 2022-08-09T15:28:27Z

@brianpm - was your failure without the analytic initial condition repeatable or intermittent?

brianpm · 2022-08-09T15:30:26Z

I don't know. With analytic initial conditions the run successfully started. With initial conditions derived from regridding with Patrick's VR tools, I was seeing a failure, but I don't know if was actually repeatable. I saw it on several attempts, as I was trying to work through the case and get it running (with input from @adamrher).

adamrher · 2022-08-09T16:44:18Z

@brianpm - was your failure without the analytic initial condition repeatable or intermittent?

I'm fairly confident that Brian's issue with the TRBLT grid is repeatable. I think it was an unstable initial condition, that was resolved by running w/ analytic initial conditions. So my guess is it is not related to this issue, which is characterized by intermittent failures for an identical set of settings.

@jtruesdal mentioned that he may have gotten intermittent errors with the TRBLT grid, though.

But so far only Isla's NATL grid can reproduce this result.

@patcal suggested he may have had a similar issue with various var-res configurations.

jtruesdal · 2022-08-10T00:16:55Z

My TRBELT case is /glade/p/cgd/amp/jet/cases/F2000climo.ne0np4.trbelta.ne30x8_g17.intel.1080pes.chey.nuopc.cesm23alpha09d.001.dbg

My errors look to be a bad read of the initial conditions file. Right after calling read_inidat and doing a boundary exchange the prognostic fields contain some bad values. For SE the min/max of the initial state is printed and shows the bad values. This from the atm.log

STATE DIAGNOSTICS

                            MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE

U -0.932687431112143+170 0.125020681873722E+03
V -0.310895810370714+170 0.132994696270308E+03
T 0.139085716626218E+03 0.306223262802823E+03
OMEGA -0.646747652113979+170 0.674889617481827+170
OMEGA CN 0.000000000000000E+00 0.000000000000000E+00

I have debug print in the cesm.log file showing the locations of the bad values.

adamrher · 2022-08-10T00:40:35Z

@jtruesdal this looks an awful lot like the errors @renerwijn was getting in our new dual-polar var-res grid. The first print out of these stats, during the initialization phase nstep=0:

 nstep=           0  time=  0.000000000000000E+000  [day]

 STATE DIAGNOSTICS

                                MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE
  U          -0.204248117308741+233  0.160408366612220+281
  V          -0.204248117308741+233  0.160408366612220+281
  T          -0.932687431112143+170  0.308818178603708E+03
  OMEGA      -0.452352825352631+280  0.298789765442738+281

When I first saw this I was like what could possible be causing the state to go berserk? These don't resemble the values in the ncdata file. However, Rene can correct me, but the ncdata file turned out to be the problem ... or it at least, it motivated us to run the us standard atmos analytical inic for 4 weeks and spit out a new cam.i file. That cam.i file ended up being stable and not giving the egregious winds at nstep=0. So that anecdote makes me wonder whether it's just an unstable inic that yields this crazy state at nstep=0? The dycore had to have done something to the state at this point, because the ncdata file is on the dynamics grid, right? Is it doing more than just reading in the data at nstep=0?

jtruesdal · 2022-08-10T17:33:10Z

@adamrher This is printed out after the initial file is read and before dynamics runs. There is some initialization of derived quantities and mucking with edge buffers but I don't think the state is modified before the print. I will try the analytic init as suggested and create a new initial condition. I guess there could still be some corruption or incompatibility in the NetCDF initial file I'm using.

jtruesdal · 2022-08-12T15:55:54Z

@adamrher The analytic IC worked as did a restart from that run. Unfortunately using the IC produced by the analytic run exhibited the same behavior as before, sometimes working but most of the time reading an assortment of bad values. The failures show up under the STATE DIAGNOSTIC print in the atm log file and are garbage values not NaN's or INF's. Jim's test also has a bad state. The fields are read via infld and the errors seem to be confined to the 3d fields. The garbage values are intermingled with reads of good values on numerous processors. Maybe @gold2718 was right when thinking that the variable resolution data is exposing an issue in infld.

adamrher · 2022-08-12T17:16:05Z

Jim's test also has a bad state.

indeed, from Jim's intermittent failure run at the top of this thread:

 nstep=           0  time=  0.000000000000000E+000  [day]

 STATE DIAGNOSTICS

                                MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE
  U          -0.406066707886934+302  0.245322431892797+301
  V          -0.112625708802609E+03  0.829671412241384E+02
  T           0.177824544971257E+03  0.306593499694223E+03
  OMEGA      -0.517987814127956+290  0.401590425697583+290

So we have three separate var-res configurations that are able to reproduce this error. At least we're converging on the issue...

jtruesdal · 2022-09-27T16:38:53Z

Ill test my case this afternoon and report back.

cacraigucar · 2022-09-27T17:37:10Z

@adamrher - Can you give us the information that we need for the pio and ccs_config changes? What tags did you use for these, so I can add them to PR #659. Those are the only changes required to fix this, correct?

cacraigucar · 2022-09-27T17:46:51Z

@adamrher - Also if you can give us the details on the three tests you want to add to the regression tests, we can work on including them? We can work the details on this offline if needed.

adamrher · 2022-09-27T18:30:44Z

@cacraigucar regarding the three tests, these seem reasonable to me:

<test compset="FHIST" grid="ne0ARCTICne30x4_ne0ARCTICne30x4_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">
<test compset="FHIST" grid="ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">
<test compset="FHIST" grid="ne0CONUSne30x8_ne0CONUSne30x8_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">

I would set the walltime to 30 minutes, to make sure it will still run when we double our vertical resolution in FHIST runs to L58. I defer to the se's on whether ERP is the best test (if we only get to choose one) ... I'm just more familiar with it.

The only var-res we have now are FW (WACCM) tests, so I think it will be good to have these less complex, but arguably more common compsets working for all three grids. I will note that currently the CONUS will not run out of the box w/ FHIST because at least one emission file does not have year 1979 data in it (I suspect this is because that ACOM folks like to run CONUS with short nudged runs in a more recent year, and didn't bother to make the emissions work for 1979.

Note that for arctic and greenland var-res grids, the emissions files are not on the native grids, which means they are interpolated on the fly from probably f09 files. ACOM likes to have emissions on the native grids for hi-res (I'm less picky). So I think to resolve this issue we should just ask ACOM to extent their CONUS emissions files to include year 1979.

adamrher · 2022-09-27T18:46:54Z

[people are asking for me to explicitly state the pio and ccs_versions needed for this fix: update externals to pio2_5_9 and the current head of ccs_config (ccs_config_cesm0.0.44, I believe)]

jtruesdal · 2022-09-27T22:38:45Z

The TRBELT tests worked. I updated pio and the ccs_config manually and have finished a few runs to completion. I also verified a restart run using the global integrals from the log. Everything completes and matches. This looks good from my end.

cacraigucar · 2022-10-03T20:39:06Z

regression tests on cheyenne indicate baseline answer changes (which is not expected). @jedwards4b has the following summary:

I can confirm that there is an answer change when I use the new tags.
If I try to update just ESMF I get a runtime failure in the test.
If I try to update just pio all tests pass.

I'm still looking for something in between.

jedwards4b · 2022-10-03T21:29:16Z

Updating to esmf-8.3.0-ncdfio-mpt-O also causes an answer change.

jedwards4b · 2022-10-03T22:25:33Z

Updating to esmf-8.3.0b13-ncdfio-mpt-O also fails baseline compare.

jedwards4b · 2022-10-04T20:08:29Z

Using esmf-8.3.0b07 passes baseline (also using pio2.5.9)

adamrher · 2022-10-04T20:18:40Z

Should we reach out to the ESMF team to ask if there are any expected answer changes? If I recall from your test failures they were all either cam-se or mpas -- all unstructured grids. I recall maybe Bob Oehmke saying that a fix was made to mapping algo for unstructured grids a while back and to switch to a more recent library. Or maybe it was something else ...

jedwards4b · 2022-10-04T20:21:17Z

@adamrher Yes I am working with the ESMF team.

cacraigucar · 2022-10-05T19:10:52Z

To document this here, answer changes were seen in the following CAM regression tests:
ERP_Ln9_Vnuopc.ne30_ne30_mg17.FCnudged.cheyenne_intel.cam-outfrq9s (Overall: DIFF) details:
ERS_Ln9_P288x1_Vnuopc.mpasa120_mpasa120.F2000climo.cheyenne_intel.cam-outfrq9s_mpasa120 (Overall: DIFF) details:
ERS_Ln9_P36x1_Vnuopc.mpasa480_mpasa480.F2000climo.cheyenne_intel.cam-outfrq9s_mpasa480 (Overall: DIFF) details:
SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCHIST.cheyenne_intel.cam-outfrq9s_refined_camchem (Overall: DIFF) details:
SMS_D_Ln9_Vnuopc.ne16_ne16_mg17.FX2000.cheyenne_intel.cam-outfrq9s (Overall: DIFF) details:

The ESMF team is working to identify the cause of the answer changes

cacraigucar · 2022-10-05T23:00:54Z

@jedwards4b - I have a test which is also flat out failing. After thinking it might be a cheyenne hiccup, it keeps failing in the exact same way.

The bottom of the cesm log file is:

275: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     2868     102       0
341: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     3462     100       0
215: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     2328     102       0
335: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     3408     102       0
1: Opened file
1: SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_S
1: GLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143.cam.h0.000
1: 1-01-01-00000.nc to write         376
1: Opened file
1: SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_S
1: GLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143.cam.h7.000
1: 1-01-01-00000.nc to write         377
234:MPT ERROR: Assertion failed at reg_cache.c:302: "i == num_used"
234:MPT ERROR: Rank 234(g:234) is aborting with error code 0.
234:    Process ID: 20354, Host: r6i5n27, Program: /glade/scratch/cacraig/aux_cam_20221005141143/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50(null)P_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143/bld/cesm.exe

The latest job can be seen at:
/glade/scratch/cacraig/aux_cam_20221005141143/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143

Note that the ONLY changes are the ccs_config and pio external updates to ccs_config_cesm0.0.45 and pio2_5_9

@fischer-ncar - have you encountered this as well?

fischer-ncar · 2022-10-05T23:14:44Z

Nope, I haven't seen this error. I'll try to reproduce your error with my latest alpha10a sandbox.

jedwards4b · 2022-10-05T23:59:52Z

@cacraigucar I'm pretty sure that the problem here is your pelayout of 384x3 since 384 is not an even multiple of
36. Change the test to 360x3 and try again.

fvitt · 2022-10-06T00:10:48Z

@cacraigucar I'm pretty sure that the problem here is your pelayout of 384x3 since 384 is not an even multiple of 36. Change the test to 360x3 and try again.

The 384x3 pelayout places 12 MPI tasks on cheyenne node. This evenly spreads across 32 nodes.

fischer-ncar · 2022-10-06T00:16:29Z

I was able to get this test to pass using the current alpha10a sandbox. This is using 384x3. From what you're using. The alpha10a sandbox has updates to cdeps, cmeps, cice6, ctsm, cime, cpl7, and share.

jedwards4b · 2022-10-06T12:40:17Z

I also had no problem running this test with the original pe-layout.
cat TestStatus
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s CREATE_NEWCASE
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s XML
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SETUP
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SHAREDLIB_BUILD time=390
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s MODEL_BUILD time=416
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SUBMIT
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s RUN time=199
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s MEMLEAK insuffiencient data for memleak test
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SHORT_TERM_ARCHIVER
(cheyenne) cheyenne1: /glade/scratch/jedwards/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.20221005_174926_kl1guc

cacraigucar · 2022-10-06T14:38:42Z

I checked out a new attempt of the branch as it is currently stored (to make sure it wasn't corrupted somehow) and ran create_test on it. I still get the same results, so there must be something different between @jedwards4b setup and mine.

My code base is at:
/glade/u/home/cacraig/cam_fix_irrep_results

The failed test is at:
/glade/scratch/cacraig/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.20221006_081400_ywi7ei/run

It is also worth reiterating that I am only changing pio and ccs_config externals. This test worked fine in all previous CAM tags

jedwards4b · 2022-10-06T15:17:08Z

I tried again and passed again. I see this difference in our cases:

185c185
me <   model_version = "cam6_3_077"
---
you >   model_version = "cam6_2_021-1988-g178544a2"

Looking at the git log confirms that you are testing an older version of cam.

cacraigucar · 2022-10-06T16:43:19Z

When I went to SRCROOT, I got the following:

cheyenne3$ git diff cam6_3_078 | less
diff --git a/Externals.cfg b/Externals.cfg
index b29291a7..239f2835 100644
--- a/Externals.cfg
+++ b/Externals.cfg
@@ -1,5 +1,5 @@
[ccs_config]
-tag = ccs_config_cesm0.0.28
+tag = ccs_config_cesm0.0.45
protocol = git
repo_url = https://github.com/ESMCI/ccs_config_cesm
local_path = ccs_config
@@ -57,7 +57,7 @@ local_path = libraries/mct
required = True

[parallelio]
-tag = pio2_5_7
+tag = pio2_5_9
protocol = git
repo_url = https://github.com/NCAR/ParallelIO
local_path = libraries/parallelio

Also manage_externals/checkout_externals --status indicate that it was all clean

Which git log is the one saying I'm using an older version of cam? (i.e. what directory were you in when you executed the command)?

jedwards4b · 2022-10-06T16:51:06Z

Top level - cam itself.
Looking closer at this I am using an older tag than you are, mine is in hash 0764b57

you are in hash
178544a

which is 6 commits ahead.

cacraigucar · 2022-10-07T19:33:12Z

The answer changes that we are seeing with these changes are due to ESMF. Here is Mariana's explanation of what is causing the differences from a separate email exchange:

Hi Bob and Cheryl,

I've looked at the diffs closely with Jim and the problem boils down to roundoff level changes in ESMF first order conservative mapping that change the computed land fraction and ocean fraction used when merging to the atm. 

As I mentioned above - what the land and data ocean and cice (in prescribed mode are doing) is taking an input mask file 
   mesh_mask = /glade/p/cesmdata/cseg/inputdata/share/meshes/gx1v7_151008_ESMFmesh.nc
and mapping the mask on this file using first order conservative mapping to the atm grid :
   mesh_atm = /glade/p/cesmdata/cseg/inputdata/share/meshes/ne30np4_ESMFmesh_cdf5_c20211018.nc
Note that the atm, lnd, ice and ocn are all on the mesh_atm grid.
If there are any differences in this mapping then the model will no longer be bfb.

If you look at the file:
/glade/scratch/jedwards/SMS_D_Ln3_P360.ne30_ne30_mg17.FCnudged.cheyenne_intel.cam-outfrq3s.C.20221005_125112_6h9p4v/cpl.diffs - you can see the following differences in the land fraction sent to the atm

atmExp_Sl_lfrac   (atmExp_nx,atmExp_ny,time)  t_index =      1     1
        107    48602  (   838,     1,     1) (     1,     1,     1) (  5097,     1,     1) (  5097,     1,     1)
               48602   1.000000000000000E+00   0.000000000000000E+00 3.1E-14  9.934056592391882E-03 2.0E-16  9.934056592391882E-03
               48602   1.000000000000000E+00   0.000000000000000E+00          9.934056592361351E-03          9.934056592361351E-03
               48602  (   838,     1,     1) (     1,     1,     1)
          avg abs field values:    2.921258617016634E-01    rms diff: 1.6E-16   avg rel diff(npos):  2.0E-16
                                   2.921258617016634E-01                        avg decimal digits(ndif): 14.5 worst: 11.5
 RMS atmExp_Sl_lfrac                  1.6118E-16            NORMALIZED  5.5175E-16

You can see that these are simply round off level changes - but they will have a strong impact on the solution.'

Frankly, I'm not concerned about this. Any order of operation change in ESMF regridding or mesh storage would result in this type of change. Bob can confirm my assumption.
Does this make sense to everyone?

Mariana

Based on this information @adamrher, @cacraigucar and Robert Oehmke have all signed off on the differences

jedwards4b added the bug Something isn't working correctly label Aug 3, 2022

cacraigucar added this to CAM Development Aug 8, 2022

cacraigucar moved this to To Do in CAM Development Aug 8, 2022

cacraigucar added this to the CAM6.5 milestone Aug 9, 2022

adamrher mentioned this issue Sep 27, 2022

CONUS FHIST compset does not run out of the box #662

Closed

cacraigucar mentioned this issue Sep 29, 2022

cam6_3_079: Fix problems with irreproducible results using variable resolution grids #666

Merged

cacraigucar assigned cacraigucar and unassigned gold2718 Oct 7, 2022

nusbaume closed this as completed Oct 14, 2022

Repository owner moved this from To Do to Done in CAM Development Oct 14, 2022

irreproducable results in variable resolution #631

irreproducable results in variable resolution #631

Comments

jedwards4b commented Aug 3, 2022

What happened?

What are the steps to reproduce the bug?

What CAM tag were you using?

What machine were you running CAM on?

What compiler were you using?

Path to a case directory, if applicable

Will you be addressing this bug yourself?

Extra info

adamrher commented Aug 3, 2022

gold2718 commented Aug 4, 2022

adamrher commented Aug 4, 2022 • edited Loading

PeterHjortLauritzen commented Aug 4, 2022 • edited Loading

jedwards4b commented Aug 4, 2022

PeterHjortLauritzen commented Aug 4, 2022 • edited Loading

jedwards4b commented Aug 4, 2022

adamrher commented Aug 4, 2022

islasimpson commented Aug 4, 2022

PeterHjortLauritzen commented Aug 4, 2022

islasimpson commented Aug 4, 2022

adamrher commented Aug 4, 2022

PeterHjortLauritzen commented Aug 4, 2022

islasimpson commented Aug 4, 2022

adamrher commented Aug 4, 2022

islasimpson commented Aug 4, 2022

islasimpson commented Aug 9, 2022

brianpm commented Aug 9, 2022

jedwards4b commented Aug 9, 2022

brianpm commented Aug 9, 2022

adamrher commented Aug 9, 2022

jtruesdal commented Aug 10, 2022

adamrher commented Aug 10, 2022

jtruesdal commented Aug 10, 2022

jtruesdal commented Aug 12, 2022

adamrher commented Aug 12, 2022

jtruesdal commented Sep 27, 2022

cacraigucar commented Sep 27, 2022

cacraigucar commented Sep 27, 2022

adamrher commented Sep 27, 2022

adamrher commented Sep 27, 2022

jtruesdal commented Sep 27, 2022

cacraigucar commented Oct 3, 2022 • edited Loading

jedwards4b commented Oct 3, 2022

jedwards4b commented Oct 3, 2022

jedwards4b commented Oct 4, 2022

adamrher commented Oct 4, 2022 • edited Loading

jedwards4b commented Oct 4, 2022

cacraigucar commented Oct 5, 2022

cacraigucar commented Oct 5, 2022

fischer-ncar commented Oct 5, 2022

jedwards4b commented Oct 5, 2022

fvitt commented Oct 6, 2022

fischer-ncar commented Oct 6, 2022

jedwards4b commented Oct 6, 2022

cacraigucar commented Oct 6, 2022 • edited Loading

jedwards4b commented Oct 6, 2022 • edited Loading

cacraigucar commented Oct 6, 2022

jedwards4b commented Oct 6, 2022

cacraigucar commented Oct 7, 2022 • edited Loading

adamrher commented Aug 4, 2022 •

edited

Loading

PeterHjortLauritzen commented Aug 4, 2022 •

edited

Loading

PeterHjortLauritzen commented Aug 4, 2022 •

edited

Loading

cacraigucar commented Oct 3, 2022 •

edited

Loading

adamrher commented Oct 4, 2022 •

edited

Loading

cacraigucar commented Oct 6, 2022 •

edited

Loading

jedwards4b commented Oct 6, 2022 •

edited

Loading

cacraigucar commented Oct 7, 2022 •

edited

Loading