Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.derecho_gnu.clm-coldStart test failure in ctsm5.2 branch #2366

Closed
1 task done
slevis-lmwg opened this issue Feb 15, 2024 · 5 comments · Fixed by #2371
Closed
1 task done
Assignees
Labels
bug something is working incorrectly

Comments

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Feb 15, 2024

Brief summary of bug

The aforementioned test fails in the ctsm5.2 branch when pointing the aux_clm test-suite to the newly generated fsurdat files:

ERROR in CompareBulkToTracer: tracer does not agree with bulk water
Called from: after downscale_forcings
Variable: ice1_grc
First difference at index: 2
Bulk  :  -0.37548989083934737-321
Tracer:  -0.37598395648518862-320
ratio:   0.10000000000000000E+02
Bulk*ratio:  -0.37548989083934737-320

At first glance, this seems like a repeat of #2041. In particular, it's the ctsm5.2 branch pointing to a new fsurdat file and it's the same error message but for variable ice1_grc this time. Further investigation has revealed a different problem this time. Details below.

General bug information

CTSM version you are using:
alpha-ctsm5.2.mksrf.19_ctsm5.1.dev163-55-g1f4611fb7

Does this bug cause significantly incorrect results in the model's science?
I hope not "significantly" though see more info in the details below.

Configurations affected:
I think all configurations.

Details of bug

The test passes with a one-line change. Note that the change follows liqcan as a template by including the max function in snocan instead of omitting it:

@@ -1605,7 +1605,8 @@ bioms:   do f = 1, fn
          if (t_veg(p) > tfrz ) then ! above freezing, update accumulation in liqcan
             if ((qflx_evap_veg(p)-qflx_tran_veg(p))*dtime > liqcan(p)) then ! all liq evap
                ! In this case, all liqcan will evap. Take remainder from snocan
-               snocan(p)=snocan(p)+liqcan(p)+(qflx_tran_veg(p)-qflx_evap_veg(p))*dtime  
+               snocan(p) = max(0._r8, &
+                  snocan(p) + liqcan(p) + (qflx_tran_veg(p) - qflx_evap_veg(p)) * dtime)
             end if
             liqcan(p) = max(0._r8,liqcan(p)+(qflx_tran_veg(p)-qflx_evap_veg(p))*dtime)

I converged on this line of code as the culprit by realizing that the original error message did not just indicate tiny values as in #2041, but tiny negative values. So I pursued all sources of negative ice1_grc. I found that even after the above change, the code continues to generate negative ice1_grc in subroutine ComputeLiqIceMassNonLake after subtracting dynbal_baseline_ice(c). By the time the LWISO test completes, I see ice1_grc values < -800 kg/m2.

So this issue has two parts:

  • Get the test passing with the above fix

Discussion below suggests that other negative ice1_grc (as described above) may be ok, so this issue will close when #2371 gets merged.

@slevis-lmwg slevis-lmwg self-assigned this Feb 15, 2024
@slevis-lmwg slevis-lmwg added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Feb 15, 2024
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Feb 15, 2024

I added this to the 5.2 project board , and I see the following options:

  1. As with issue LWISO_Ld10.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_gnu.clm-coldStart FAIL in ctsm5.2 #2041 and corresponding PR Change small snocan to zero #2053, I resolve this and merge it with ctsm5.1 before making ctsm5.2, because the correction is answer-changing, while 5.2 only makes dataset-related changes.
  2. We leave the test as EXPECTED FAILURE and resolve in ctsm5.3.

Thoughts/feedback?

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 15, 2024

Ahh, good question @slevis-lmwg. Either way is doable. By the way I made the two options numbered so we can easily reference them by number. From reading the above issue I see that you have a simple one line fix that would be easy to bring in anytime. But, you also suggest that more work might need to be done to fix negative numbers here, and would be something we'd likely want to do in ctsm5.3 as it's a potential longer term thing that requires painful debugging and impossible to give a timeline on.

However, does the simple one line fix get the case working for CTSM5.2? Don't check the change in, just see if it allows the model to work. If so I'd say do number 1. If it doesn't then we should plan on number 2 and it being an expected fail and work on it later.

@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Feb 15, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Feb 15, 2024

In meeting we decided that @slevis-lmwg will do a simple answer changing tag to master that gets this working. Then update the ctsm5.2 branch to that, and show that this test works, so ctsm5.2 can come in without this being an expected fail.

@slevis-lmwg
Copy link
Contributor Author

In the meeting we also decided to consider moving the liqcan(p) = line up to prevent code duplication.

Also I will bring #2355 into main with this answer-changing tag.

@billsacks
Copy link
Member

Thanks a lot for your work on this @slevis-lmwg and for tracking down the issue!

I think your solution – applying max(0, ...) to the snocan update – seems reasonable for now. I'm nervous that both this and the pre-existing liqcan update could just as easily generate tiny positive values that should actually be 0, which will lead to similar numerical issues. So a more robust solution would be to change both the liqcan and snocan updates to use truncate_small_values. But I'm okay kicking that can down the road if it isn't causing problems for now.

Regarding the negative ice1 values after subtracting dynbal_baseline_ice: I would need to spend some time getting my head back into this to say for certain, but my initial reaction is that this might be okay. There are some arbitrary state variables in the model, such as the amount of ice in glacier columns and the amount of water in lakes, and these baselines adjust for these to help avoid unnecessarily large adjustment fluxes when we have dynamic landunits. (This is partly explained in the ChangeLog entry for ctsm1.0.dev031.) I think that a negative ice1 value then just means that there is less ice now than there was initially, which would be okay. That said, the negative value you give does seem awfully large, so this may be a true issue.... But even so, I'd say that there's a reasonable chance that you'll find that this is working as intended, so it probably isn't worth the time to dig into right now.

slevis-lmwg added a commit to slevis-lmwg/ctsm that referenced this issue Feb 15, 2024
slevis-lmwg added a commit that referenced this issue Feb 16, 2024
Do not allow negative snocan in a part of the code that triggers the error reported in issue #2366.
slevis-lmwg added a commit that referenced this issue Feb 17, 2024
Remove a case of negative snocan in CanopyFluxesMod to resolve #2366

One-line change that now includes "max(0._r8," to prevent negative
snocan in that part of the code.
wwieder added a commit to olyson/ctsm that referenced this issue Feb 21, 2024
Remove a case of negative snocan in CanopyFluxesMod to resolve ESCOMP#2366

One-line change that now includes "max(0._r8," to prevent negative
snocan in that part of the code.
samsrabin added a commit to samsrabin/CTSM that referenced this issue Feb 21, 2024
Remove a case of negative snocan in CanopyFluxesMod to resolve ESCOMP#2366

One-line change that now includes "max(0._r8," to prevent negative
snocan in that part of the code.
samsrabin added a commit to samsrabin/CTSM that referenced this issue Feb 22, 2024
Remove a case of negative snocan in CanopyFluxesMod to resolve ESCOMP#2366

One-line change that now includes "max(0._r8," to prevent negative
snocan in that part of the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants