Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decomp test case to the drying slope test group #695

Merged
merged 1 commit into from
Oct 16, 2023

Conversation

cbegeman
Copy link
Collaborator

@cbegeman cbegeman commented Sep 5, 2023

This PR adds a decomp test case to the drying slope test group.

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite

This test case helps evaluate #686.

@cbegeman cbegeman self-assigned this Sep 5, 2023
@cbegeman cbegeman requested review from xylar and gcapodag September 5, 2023 19:26
@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 5, 2023

Testing

I ran both ocean/drying_slope/1km/sigma/decomp and ocean/drying_slope/1km/single_layer/decomp on chrysalis with intel, openmpi. I also ran the wetdry suite which now has these cases as well. They fail as expected due to the issue pointed out by @gcapodag E3SM-Project/E3SM#5902.

@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 5, 2023

@gcapodag This is the branch I used to replicate your issue. I figured I'd add it to compass. It would be great if you had a chance to check it out and run (maybe both with master and your fix).

@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 5, 2023

@xylar I realize that a threads test will also be handy. I plan to add this with other extensions of the drying_slope case.

@cbegeman cbegeman force-pushed the ocn-add-drying-slope-decomp-test branch from cfa846c to e0ba5f0 Compare September 5, 2023 19:46
@gcapodag
Copy link
Collaborator

gcapodag commented Sep 6, 2023

Thank you @cbegeman . I tested on Perlmutter with gnu. I ran ocean/drying_slope/1km/sigma/decomp and ocean/drying_slope/1km/single_layer/decomp with master and the tests fail. With the fix I came up with in E3SM-Project/E3SM#5902 they both pass. It is sufficient to just add this halo update call

call mpas_dmpar_field_halo_exch(domain, 'layerThicknessEdgeFlux')

right before this code in RK4:

        block => domain % blocklist
        do while (associated(block))
           call ocn_time_integrator_rk4_compute_vel_tends(domain, block, dt, rk_substep_weights(rk_step), domain % dminfo, err )

           call ocn_time_integrator_rk4_compute_thick_tends( block, dt, rk_substep_weights(rk_step), err )
           block => block % next
        end do

I wonder if the reason why we need this update should be investigated further. As I mentioned to @xylar , the layerThickEdgeFlux variable is incorrect at the beginning of the RK4 routine, when that diagnostic variable is computed with the old solution, so it does not seem an issue related to RK4 itself. I understand that with split_explicit problems that might be related to this have not shown, though we should probably still look into the computation of layerThickEdgeFlux, at least to understand why problems are not showing with split_explicit but they are in the present case.

@gcapodag
Copy link
Collaborator

gcapodag commented Sep 6, 2023

Looking at the computation of layerThickEdgeFlux right now, considering the case of upwind which seems what currently is set on the namelist for the tests mentioned above:

         do iEdge = 1, nEdgesAll
            kmin = minLevelEdgeBot(iEdge)
            kmax = maxLevelEdgeTop(iEdge)
            cell1 = cellsOnEdge(1,iEdge)
            cell2 = cellsOnEdge(2,iEdge)
            do k=1,nVertLevels
               ! initialize layerThicknessEdgeFlux to avoid divide by
               ! zero and NaN problems.
               layerThickEdgeFlux(k,iEdge) = -1.0e34_RKIND
            end do
            do k = kmin,kmax
               if (normalVelocity(k,iEdge) > 0.0_RKIND) then
                  layerThickEdgeFlux(k,iEdge) = layerThickness(k,cell1)
               elseif (normalVelocity(k,iEdge) < 0.0_RKIND) then
                  layerThickEdgeFlux(k,iEdge) = layerThickness(k,cell2)
               else
                  layerThickEdgeFlux(k,iEdge) = &
                                      max(layerThickness(k,cell1), &
                                          layerThickness(k,cell2))
               end if
            end do
         end do

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 6, 2023

@gcapodag Thanks for doing that testing. I'm happy to take a look at the RK4 routine to try to understand why the line is needed.

Upwinding with split-explicit does pass a 1 proc vs 12 proc decomp test when applied to the baroclinic channel test case. It cannot be tested with the drying slope test case because I have not yet completed the W&D implementation for split-explicit.

@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 7, 2023

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

@gcapodag With respect to this question, I don't think we need to worry about looping over nEdgesAll because any cell that is outside the domain will have a maxLevelEdgeTop of 0 and thus the layerThickEdgeFlux will stay the initialized value of -1e34.

@gcapodag
Copy link
Collaborator

gcapodag commented Sep 7, 2023

I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over nEdgesAll that includes the outermost edges the process has access to, but then we use cellsOnEdge, so would that not return a cell that is even outside of the halo cells of the process? Even if layerThickness has been updated before this computation, doesn't that only apply to the owned+halo cells? If we try to access a cell that is even outside the owned+halo would layerThickness be reliable?

@gcapodag With respect to this question, I don't think we need to worry about looping over nEdgesAll because any cell that is outside the domain will have a maxLevelEdgeTop of 0 and thus the layerThickEdgeFlux will stay the initialized value of -1e34.

Hi @cbegeman , by outside the domain you mean outside the actual global domain or outside the partition handled by a given process?

@cbegeman
Copy link
Collaborator Author

cbegeman commented Sep 7, 2023

Hi @cbegeman , by outside the domain you mean outside the actual global domain or outside the partition handled by a given process?

@gcapodag Oh, I misunderstood. I was talking about the whole domain. You're concerned with the partition. I believe I was motivated by trying to define layerThickEdgeFlux so that it would be defined for loops like this, where config_num_halos is 3 by default.

https://github.com/E3SM-Project/E3SM/blob/82c581e987a397ffd601971b108e788af9a15615/components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_split.F#L802-L817

        !$omp parallel
        !$omp do schedule(runtime) &
        !$omp private(cell1, cell2, k, thicknessSum)
        do iEdge = 1, nEdgesHalo(config_num_halos+1)
           cell1 = cellsOnEdge(1,iEdge)
           cell2 = cellsOnEdge(2,iEdge)
           thicknessSum = layerThickEdgeFlux(minLevelEdgeBot(iEdge),iEdge)
           do k = minLevelEdgeBot(iEdge)+1, maxLevelEdgeTop(iEdge)
              thicknessSum = thicknessSum + &
                             layerThickEdgeFlux(k,iEdge)
           enddo
           bottomDepthEdge(iEdge) = thicknessSum &
              - 0.5_RKIND*(sshNew(cell1) + sshNew(cell2))
        enddo ! iEdge
        !$omp end do
        !$omp end parallel

I suppose you could try replacing nEdgesAll with nEdgesHalo(config_num_halos+1) in the loop where we compute layerThickEdgeFlux and see if the solution is identical.

Am I getting warmer in terms of addressing your concerns? I'm not the most knowledgeable about halos in MPAS-O.

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I lost track of the review request here. I haven't run the test myself but the code looks great! I'm happy to have you merge when you're ready.

@xylar xylar added enhancement New feature or request ocean labels Oct 4, 2023
@cbegeman
Copy link
Collaborator Author

cbegeman commented Oct 4, 2023

@xylar Thank you for reviewing! Would you rather I merge with the 1 vs. 12 proc. comparison (as written here), which will fail because we don't yet have @gcapodag 's bugfix, or replace it with the 4 vs. 8 proc. comparison, which will pass?

@xylar
Copy link
Collaborator

xylar commented Oct 4, 2023

I think it's better if it fails to remind us to merge the fix. I guess we could wait and merge the test after the fix is in but I think it's fine if the test fails as long as it's not in the pr suite.

@cbegeman
Copy link
Collaborator Author

cbegeman commented Oct 4, 2023

@gcapodag Did you get a chance to check out this branch and try running it? If so, could you approve the PR? You're welcome to provide other feedback if you have any. If you don't have time to review, let me know.

@gcapodag
Copy link
Collaborator

gcapodag commented Oct 5, 2023

Hi @xylar and @cbegeman , I'll do it on Monday if that's OK. Thanks!

@cbegeman
Copy link
Collaborator Author

cbegeman commented Oct 5, 2023

@gcapodag That works for me. Thanks!

Copy link
Collaborator

@gcapodag gcapodag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbegeman, I tested again on Perlmutter and I think we can merge. I took some time to look further into why the tests are failing in the current setup, and found out that the problem is with layerThickness and not with layerThickEdgeFlux. There is a freaky situation happening in which for some not yet identified reason, the layerThickness is not BFB anymore at some point during the RK4 time stepping and that affects the computation of layerThickEdgeFlux. If I add a halo update in RK4 right before the final computation of the diagnostic variables, the BFB issue is gone (so we do not need a halo update on layerThickEdgeFlux). The interesting/weird/troubling thing is that the issue with layerThickEdgeFlux is actually present also with 8 procs for instance, but due to the process configuration I suppose, the "bad" value is actually never used in the computation of the hadv_coriolis tendency contribution. This is a print from inside the loop on edge on edge in hadv_coriolis for 8 procs (focusing on eoe such that indexToEdgeID(eoe)==14):

 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18
 from tendency:    1.3944039866367177               11          18

the first number is layerThickEdgeFlux(k,eoe) and the other two are the two cells in local numbering that share that edge. Note that the edge eoe that is picked up is always the same, which happens to be the one with the correct value of layerThickEdge. For 10 procs, one time the loop picks up the edge with the wrong (i.e. not halo updated value):

 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    **10.045894683899487**               31          45
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14
 from tendency:    1.3944039866367177                7          14

So now, there are two options in my opinion: 1. add a halo update on the layer thickness before the final computation of the diagnostics in RK4. 2. spend time to understand why the layer thickness is not BFB in the first place. Obviousuly one route is less painful than the other.

@cbegeman
Copy link
Collaborator Author

@gcapodag Thanks for your testing and this additional context! I'm moving away from using RK4 configurations on my projects, and since RK4 is also not used in E3SM configurations, this may really be a decision for ICoM. I think that given ICoM's priorities the former option would be sufficient. However, given the shifting priorities, you could raise this with Rob Hetland and see what he thinks.

@cbegeman cbegeman merged commit ddbe352 into MPAS-Dev:main Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ocean
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants