Add decomp test case to the drying slope test group #695

cbegeman · 2023-09-05T19:26:43Z

This PR adds a decomp test case to the drying slope test group.

Checklist

User's Guide has been updated
Developer's Guide has been updated
API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
Documentation has been built locally and changes look as expected
Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
New tests have been added to a test suite

This test case helps evaluate #686.

cbegeman · 2023-09-05T19:29:59Z

Testing

I ran both ocean/drying_slope/1km/sigma/decomp and ocean/drying_slope/1km/single_layer/decomp on chrysalis with intel, openmpi. I also ran the wetdry suite which now has these cases as well. They fail as expected due to the issue pointed out by @gcapodag E3SM-Project/E3SM#5902.

cbegeman · 2023-09-05T19:31:34Z

@gcapodag This is the branch I used to replicate your issue. I figured I'd add it to compass. It would be great if you had a chance to check it out and run (maybe both with master and your fix).

cbegeman · 2023-09-05T19:32:25Z

@xylar I realize that a threads test will also be handy. I plan to add this with other extensions of the drying_slope case.

gcapodag · 2023-09-06T16:28:39Z

Thank you @cbegeman . I tested on Perlmutter with gnu. I ran ocean/drying_slope/1km/sigma/decomp and ocean/drying_slope/1km/single_layer/decomp with master and the tests fail. With the fix I came up with in E3SM-Project/E3SM#5902 they both pass. It is sufficient to just add this halo update call

call mpas_dmpar_field_halo_exch(domain, 'layerThicknessEdgeFlux')

right before this code in RK4:

        block => domain % blocklist
        do while (associated(block))
           call ocn_time_integrator_rk4_compute_vel_tends(domain, block, dt, rk_substep_weights(rk_step), domain % dminfo, err )

           call ocn_time_integrator_rk4_compute_thick_tends( block, dt, rk_substep_weights(rk_step), err )
           block => block % next
        end do

I wonder if the reason why we need this update should be investigated further. As I mentioned to @xylar , the layerThickEdgeFlux variable is incorrect at the beginning of the RK4 routine, when that diagnostic variable is computed with the old solution, so it does not seem an issue related to RK4 itself. I understand that with split_explicit problems that might be related to this have not shown, though we should probably still look into the computation of layerThickEdgeFlux, at least to understand why problems are not showing with split_explicit but they are in the present case.

gcapodag · 2023-09-06T17:01:20Z

Looking at the computation of layerThickEdgeFlux right now, considering the case of upwind which seems what currently is set on the namelist for the tests mentioned above:

         do iEdge = 1, nEdgesAll
            kmin = minLevelEdgeBot(iEdge)
            kmax = maxLevelEdgeTop(iEdge)
            cell1 = cellsOnEdge(1,iEdge)
            cell2 = cellsOnEdge(2,iEdge)
            do k=1,nVertLevels
               ! initialize layerThicknessEdgeFlux to avoid divide by
               ! zero and NaN problems.
               layerThickEdgeFlux(k,iEdge) = -1.0e34_RKIND
            end do
            do k = kmin,kmax
               if (normalVelocity(k,iEdge) > 0.0_RKIND) then
                  layerThickEdgeFlux(k,iEdge) = layerThickness(k,cell1)
               elseif (normalVelocity(k,iEdge) < 0.0_RKIND) then
                  layerThickEdgeFlux(k,iEdge) = layerThickness(k,cell2)
               else
                  layerThickEdgeFlux(k,iEdge) = &
                                      max(layerThickness(k,cell1), &
                                          layerThickness(k,cell2))
               end if
            end do
         end do

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

cbegeman · 2023-09-06T18:51:00Z

@gcapodag Thanks for doing that testing. I'm happy to take a look at the RK4 routine to try to understand why the line is needed.

Upwinding with split-explicit does pass a 1 proc vs 12 proc decomp test when applied to the baroclinic channel test case. It cannot be tested with the drying slope test case because I have not yet completed the W&D implementation for split-explicit.

cbegeman · 2023-09-07T16:53:58Z

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

@gcapodag With respect to this question, I don't think we need to worry about looping over nEdgesAll because any cell that is outside the domain will have a maxLevelEdgeTop of 0 and thus the layerThickEdgeFlux will stay the initialized value of -1e34.

gcapodag · 2023-09-07T16:58:08Z

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

@gcapodag With respect to this question, I don't think we need to worry about looping over nEdgesAll because any cell that is outside the domain will have a maxLevelEdgeTop of 0 and thus the layerThickEdgeFlux will stay the initialized value of -1e34.

Hi @cbegeman , by outside the domain you mean outside the actual global domain or outside the partition handled by a given process?

cbegeman · 2023-09-07T19:21:48Z

Hi @cbegeman , by outside the domain you mean outside the actual global domain or outside the partition handled by a given process?

@gcapodag Oh, I misunderstood. I was talking about the whole domain. You're concerned with the partition. I believe I was motivated by trying to define layerThickEdgeFlux so that it would be defined for loops like this, where config_num_halos is 3 by default.

https://github.com/E3SM-Project/E3SM/blob/82c581e987a397ffd601971b108e788af9a15615/components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_split.F#L802-L817

        !$omp parallel
        !$omp do schedule(runtime) &
        !$omp private(cell1, cell2, k, thicknessSum)
        do iEdge = 1, nEdgesHalo(config_num_halos+1)
           cell1 = cellsOnEdge(1,iEdge)
           cell2 = cellsOnEdge(2,iEdge)
           thicknessSum = layerThickEdgeFlux(minLevelEdgeBot(iEdge),iEdge)
           do k = minLevelEdgeBot(iEdge)+1, maxLevelEdgeTop(iEdge)
              thicknessSum = thicknessSum + &
                             layerThickEdgeFlux(k,iEdge)
           enddo
           bottomDepthEdge(iEdge) = thicknessSum &
              - 0.5_RKIND*(sshNew(cell1) + sshNew(cell2))
        enddo ! iEdge
        !$omp end do
        !$omp end parallel

I suppose you could try replacing nEdgesAll with nEdgesHalo(config_num_halos+1) in the loop where we compute layerThickEdgeFlux and see if the solution is identical.

Am I getting warmer in terms of addressing your concerns? I'm not the most knowledgeable about halos in MPAS-O.

xylar

Sorry I lost track of the review request here. I haven't run the test myself but the code looks great! I'm happy to have you merge when you're ready.

cbegeman · 2023-10-04T19:19:44Z

@xylar Thank you for reviewing! Would you rather I merge with the 1 vs. 12 proc. comparison (as written here), which will fail because we don't yet have @gcapodag 's bugfix, or replace it with the 4 vs. 8 proc. comparison, which will pass?

xylar · 2023-10-04T21:12:20Z

I think it's better if it fails to remind us to merge the fix. I guess we could wait and merge the test after the fix is in but I think it's fine if the test fails as long as it's not in the pr suite.

cbegeman · 2023-10-04T21:25:38Z

@gcapodag Did you get a chance to check out this branch and try running it? If so, could you approve the PR? You're welcome to provide other feedback if you have any. If you don't have time to review, let me know.

gcapodag · 2023-10-05T15:45:00Z

Hi @xylar and @cbegeman , I'll do it on Monday if that's OK. Thanks!

cbegeman · 2023-10-05T16:13:47Z

@gcapodag That works for me. Thanks!

gcapodag

Hi @cbegeman, I tested again on Perlmutter and I think we can merge. I took some time to look further into why the tests are failing in the current setup, and found out that the problem is with layerThickness and not with layerThickEdgeFlux. There is a freaky situation happening in which for some not yet identified reason, the layerThickness is not BFB anymore at some point during the RK4 time stepping and that affects the computation of layerThickEdgeFlux. If I add a halo update in RK4 right before the final computation of the diagnostic variables, the BFB issue is gone (so we do not need a halo update on layerThickEdgeFlux). The interesting/weird/troubling thing is that the issue with layerThickEdgeFlux is actually present also with 8 procs for instance, but due to the process configuration I suppose, the "bad" value is actually never used in the computation of the hadv_coriolis tendency contribution. This is a print from inside the loop on edge on edge in hadv_coriolis for 8 procs (focusing on eoe such that indexToEdgeID(eoe)==14):

 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18

the first number is layerThickEdgeFlux(k,eoe) and the other two are the two cells in local numbering that share that edge. Note that the edge eoe that is picked up is always the same, which happens to be the one with the correct value of layerThickEdge. For 10 procs, one time the loop picks up the edge with the wrong (i.e. not halo updated value):

 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    **10.045894683899487**               31          45
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14

So now, there are two options in my opinion: 1. add a halo update on the layer thickness before the final computation of the diagnostics in RK4. 2. spend time to understand why the layer thickness is not BFB in the first place. Obviousuly one route is less painful than the other.

cbegeman · 2023-10-16T15:46:13Z

@gcapodag Thanks for your testing and this additional context! I'm moving away from using RK4 configurations on my projects, and since RK4 is also not used in E3SM configurations, this may really be a decision for ICoM. I think that given ICoM's priorities the former option would be sufficient. However, given the shifting priorities, you could raise this with Rob Hetland and see what he thinks.

cbegeman self-assigned this Sep 5, 2023

cbegeman requested review from xylar and gcapodag September 5, 2023 19:26

Add decomp test to the drying slope case

e0ba5f0

cbegeman force-pushed the ocn-add-drying-slope-decomp-test branch from cfa846c to e0ba5f0 Compare September 5, 2023 19:46

xylar approved these changes Oct 4, 2023

View reviewed changes

xylar added enhancement New feature or request ocean labels Oct 4, 2023

gcapodag approved these changes Oct 16, 2023

View reviewed changes

cbegeman merged commit ddbe352 into MPAS-Dev:main Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decomp test case to the drying slope test group #695

Add decomp test case to the drying slope test group #695

cbegeman commented Sep 5, 2023

cbegeman commented Sep 5, 2023 •

edited

Loading

cbegeman commented Sep 5, 2023

cbegeman commented Sep 5, 2023

gcapodag commented Sep 6, 2023

gcapodag commented Sep 6, 2023

cbegeman commented Sep 6, 2023

cbegeman commented Sep 7, 2023

gcapodag commented Sep 7, 2023

cbegeman commented Sep 7, 2023 •

edited

Loading

xylar left a comment

cbegeman commented Oct 4, 2023

xylar commented Oct 4, 2023

cbegeman commented Oct 4, 2023

gcapodag commented Oct 5, 2023

cbegeman commented Oct 5, 2023

gcapodag left a comment

cbegeman commented Oct 16, 2023

Add decomp test case to the drying slope test group #695

Add decomp test case to the drying slope test group #695

Conversation

cbegeman commented Sep 5, 2023

cbegeman commented Sep 5, 2023 • edited Loading

Testing

cbegeman commented Sep 5, 2023

cbegeman commented Sep 5, 2023

gcapodag commented Sep 6, 2023

gcapodag commented Sep 6, 2023

cbegeman commented Sep 6, 2023

cbegeman commented Sep 7, 2023

gcapodag commented Sep 7, 2023

cbegeman commented Sep 7, 2023 • edited Loading

xylar left a comment

Choose a reason for hiding this comment

cbegeman commented Oct 4, 2023

xylar commented Oct 4, 2023

cbegeman commented Oct 4, 2023

gcapodag commented Oct 5, 2023

cbegeman commented Oct 5, 2023

gcapodag left a comment

Choose a reason for hiding this comment

cbegeman commented Oct 16, 2023

cbegeman commented Sep 5, 2023 •

edited

Loading

cbegeman commented Sep 7, 2023 •

edited

Loading