Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seabed stress - remove if statements #673

Merged
merged 9 commits into from
Feb 23, 2022

Conversation

TillRasmussen
Copy link
Contributor

@TillRasmussen TillRasmussen commented Dec 12, 2021

For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium,
please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers

PR checklist

  • Short (1 sentence) summary of your PR:
  • For better optimization. If statements are removed from stress bed calculations, stressa and stepu.
  • Developer(s): @TillRasmussen
  • Suggest PR reviewers from list in the column to the right. @JFLemieux73, @eclare108213, @apcraig, @phil-blain
  • Please copy the PR test results link or provide a summary of testing completed below.
    [x] Bit for bit
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • [x ] No
  • Does this PR add any new test cases?
    • Yes
    • [ x] No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • [ x] No, does the documentation need to be updated at a later time?
      • Yes
      • [x ] No
  • Please provide any additional information or relevant details below:
    Timings show that some test are faster others are slow compared to current main. Results are bit for bit.
    Moved if statements outside of loops within stepu and calculation of seabed stress. the removal within stepu results in taubx and tauby are calculated at all time steps and not just the final.
    EDIT: Moved deformation out out stress calculations. Separated last evp iterations in order to only calculate when needed and to avoid if statement

…r taubx and tauby is updated on all iterations instead of just the last one. Not used within iteration
Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this is faster! Thanks for refactoring it.

@phil-blain
Copy link
Member

 Timings show that some test are faster others are slow compared to current main.

I think it would be good if we could get a sense of which tests are faster, and which are slower, and by what kind of amount. I'm guessing OpenMP cases will be faster.

@TillRasmussen
Copy link
Contributor Author

Yes it is faster on some of the openmpi test. This is primarily the change in stepu as this has been run as a "standard test". The result is that the change in the seabed stress is only called in 2 test with intel and 2 test with gnu. I have uploaded the results.log
results.log

@apcraig
Copy link
Contributor

apcraig commented Dec 13, 2021

It looks like cases with alt01, alt03, gx1prod, and seabed all have seabed_stress on. Looking at @TillRasmussen's results, I'm not sure anything conclusive can be said about performance changes. If anything, this change looks a little slower in general. Are these change being done to improve performance? If so, I think we need more evidence that it's useful. If these changes are to improve code readability, flexibility, or some other reason, then that's fine too. Lets make sure we know why we're making the change and agree it makes sense.

@eclare108213
Copy link
Contributor

It's hard to tell -- could be a toss-up. A lot of the differences, both + and -, are just in the noise of machine load from one run to the next. Running the entire suite again (including the baseline) could help narrow that uncertainty down. There are a few runs, though, that are more than 20% different and seem to me to be mostly indicating that the new version is faster, but maybe not. These would show up in @apcraig 's green/yellow/red timing colors on the testing page. @TillRasmussen can you upload the test results?

A notable exception - what's going on here?

PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin run 34.68 1.07 1.66
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin test
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin bfbcomp freya_intel_decomp_gx3_4x2x25x29x5_squarepop
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin compare neweapbase 3.97 0.83 1.58

Let's make sure that I'm interpreting the output correctly... the top line here shows the new version timings, and the last line shows the baseline timings, for the timeloop, dynamics and column timers?

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Dec 14, 2021

Performance was the motivation. I think that theoretically it should be faster as If statements within do loops should be avoided, especially when they do not depend on i,j etc. There are still if statements within the seabed_stress functions.
For the change in stepu the calculation will add additional computations, however not many.
One reason why it has not changed that much is probably that there are still memory issues and other things larger than this.
Another is the relatively small areas we run as test.

The variability can be the DMI HPC system, we see op to 50 % differences in the run time in the operational setup primarily due to high load on the hard disk (90%+) and wrong usage of a system like that. If you see the example that @eclare108213 then it is not the dynamics nor the thermodynamics that makes the difference

@TillRasmussen
Copy link
Contributor Author

For the reference I reran the example that Elizabeth extracted with the following result
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin run 4.69 1.28 1.66
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin test
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin bfbcomp freya_intel_decomp_gx3_4x2x25x29x5_squarepop
PASS freya_intel_decomp_gx3_4x2x25x29x5_sectrobin compare neweapbase 3.97 0.83 1.58

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Dec 20, 2021

A few questions remains:
Deformations calculates 4 arrays at the last iteration of the time stepping. The separation of the last sub cycling iteration and the rest could be eliminated if calculations of deformations is moved to after the last call to stepu. This will cause a small difference in the velocities used for calculation of the deformations but will not influence other results.
Is this acceptable?
Similar change could be done with sea_bed stress .
Can anyone suggest a way to copy the halo_info to halo_info_mask. This would remove the last if statement within the iteration.
Timings can be found in this attachements., The results are still a bit blury.
results.log
A motivation for doing this is to move it to be as similar to the 1d solver without the limiation of no MPI

@TillRasmussen
Copy link
Contributor Author

Timings show that some test are faster others are slow compared to current main.

I think it would be good if we could get a sense of which tests are faster, and which are slower, and by what kind of amount. I'm guessing OpenMP cases will be faster.

Note OpenMP is turned off in the evp subcycling

…b in stepu, remove if from seabed_stress_LKD
@eclare108213
Copy link
Contributor

Is this one ready to be merged? Still waiting for reviews from @JFLemieux73 @phil-blain ?

@phil-blain
Copy link
Member

I don't think it's ready for merging, I'd like to take a closer look and maybe @JFLemieux73 would too. It's not clear to me either if it's still bit for bit ? was the base_suite run with the lastest commit on this branch against main ?

@TillRasmussen
Copy link
Contributor Author

At the time of testing (December) it was bit for bit with the main trunk.
It ran all the following suites.
first_suite,base_suite,travis_suite,decomp_suite,reprosum_suite,quick_suite

@apcraig has created a pull request for OMP. Maybe this should be accepted then I can integrate these changes aftwerwards and resubmit the test.

Comment on lines +678 to 679
real (kind=dbl_kind), dimension (nx_block,ny_block), intent(in) :: &
Cw ! ocean-ice neutral drag coefficient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup

Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @TillRasmussen, I think we are going in the right direction with these changes. I left a few comments.

Also, thanks for crafting separate commits that are easy to review separately, this is very much apprciated.

Commit 3/4 (changed capping from logical to numeric...) could still be splitted even more I think, as it seems to do several distinct things:

  • remove ksub from stepu's interface and adjust callers
  • remove ksub, tarear, shear, divu, rdg_conv and rdg_shear from stress's interface and adjust callers, and moving the call to deformations on frame up.
  • decrease the do ksub loop ending index by 1 in evp and add additional separate calls to stress, stepu , etc after the loop
  • change capping to real in viscous_coeffs_and_rep_pressure and calc_zeta_dPr

Comment on lines +742 to +745
! calculate seabed stress component for outputs
! only needed on last iteration.
taubx(i,j) = -uvel(i,j)*Cb
tauby(i,j) = -vvel(i,j)*Cb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's faster to compute them each time than having an if here, then this indeed makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If statements should in general be avoided and the extra calculations are limited.. Timings are not conclusive. I would like to close this issue as it is above. Next step would be to vectorize uvel_init,, vvel_init and Tbu. These are alle "local" to ice_dyn_shared and only used in stepu. With these changes I would like to see if it makes sense to convert cb to a vector and calculate tauby and taubx after the last call to stepu just as well as deformations. An addiotonal benefit is that this align 1d solver and the 2d solver.

Comment on lines 908 to 909
! if (hwu < threshold_hw) then
docalc_tbu = max(sign(c1,threshold_hw-hwu),c0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this would be easier to read and understand with the merge intrinsic (Intel doc, GFortran doc), which is like a ternary operator:

docalc_tbu = merge(c1, c0, hwu < threshold_hw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will be in a commit a bit later.

Comment on lines -789 to -792
strintx , & ! divergence of internal ice stress, x (N/m^2)
strinty , & ! divergence of internal ice stress, y (N/m^2)
strairx , & ! stress on ice by air, x-direction
strairy ! stress on ice by air, y-direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about those deletions. The computations for Hibler-Bryan stresses are still commented in the subroutine, so if we ever add them back and test them, we would have to add these four arguments back in.

So either we also remove the commented code, or we keep as-is. I do not think that having extra arguments impact performance, as Fortran uses pass-by-reference for arrays, So it should have no impact to have extra arguments passed...

Copy link
Contributor Author

@TillRasmussen TillRasmussen Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they should be removed or reimplemented, however I dont have an opininion on which is the right to do, however the Hibler-Bryan stresses has been commented out as long as I remember. I would remove the commented code. @eclare108213 any opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave them there, but commented out.

Comment on lines +744 to +745
taubx(i,j) = -uvel(i,j)*Cb
tauby(i,j) = -vvel(i,j)*Cb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I thought we had made this change already somewhere...

@@ -1401,42 +1399,33 @@ subroutine viscous_coeffs_and_rep_pressure (strength, tinyarea, &

! NOTE: for comp. efficiency 2 x zeta and 2 x eta are used in the code

! if (trim(yield_curve) == 'ellipse') then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we keep this commented if in, @JFLemieux73 plans to eventually add more yield curves, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reinserted it

Comment on lines 482 to 495
!-----------------------------------------------------------------
! on last subcycle, save quantities for mechanical redistribution
!-----------------------------------------------------------------
call deformations (nx_block , ny_block , &
icellt(iblk) , &
indxti(:,iblk) , indxtj(:,iblk) , &
uvel(:,:,iblk) , vvel(:,:,iblk) , &
dxt(:,:,iblk) , dyt(:,:,iblk) , &
cxp(:,:,iblk) , cyp(:,:,iblk) , &
cxm(:,:,iblk) , cym(:,:,iblk) , &
tarear(:,:,iblk) , &
shear(:,:,iblk) , divu(:,:,iblk) , &
rdg_conv(:,:,iblk) , rdg_shear(:,:,iblk) )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's way cleaner to compute the deformations here, good idea. However, can we keep the & and the commas aligned ?

Comment on lines 396 to 397
if (ndte > 1) then
do ksub = 1,ndte-1 ! subcycling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this change is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ndte=1 then some compilers will create warnings and maybe even stop due to a do loop that runs from 1 to 0. The same should be included in the 1d .Now it aborts if ndte < 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be true that the code will stop? Compiler warnings maybe, but setting loop values so a loop won't be executed is done all the time and should be allowed in general. I'm not opposed to the if check, it's not going to affect performance, but I think it's cleaner not being there. Where is the check for ndte<1. It's not in this part of the code, so the if check is needed even less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a strong motivator is to get rid of if checks in loops. I assume this is for performance. My question continues to be whether we have measured a performance benefit with these changes. If not, then I prefer that we not replicate the calls for the last subcycle step in order to add the deformations call outside an if statement. My suspicion is that the if statement in the subcycling loop is not going to affect performance. Getting rid of if statements inside an ij loop, ok maybe, but not at this high a level in the computations. In my opinion, it's easier to maintain if the last iteration is not separated/duplicated. Maybe there are other reasons for separating that last subcycle. But if not, then I still prefer we not do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle with getting rid of if checks in loops. They can make a big difference in performance, but it depends on the loop size and things like how long the pipeline is for alternative commands, which probably varies a good bit across hardware/compiler configurations. Maybe it's gotten better with newer generations. The other considerations here are not having duplicative code (which I also agree with) and making close-but-not-quite-duplicated code for various dynamics options as similar as possible. Since the dynamics is typically the most expensive part of the sea ice simulation, and centers seem to be moving toward higher numbers of subcycles for better convergence, then optimizing these loops seems wise. It's hard to judge, though, if the existing timing tests are inconclusive. Question: in the original code, the deformations routine is called at the end, rather than during the loop. Is the argument for moving it up into the loop (or interrupting the loop) that the computed deformations need to be consistent with the velocities just prior to the last iteration? Or is it not better that the deformations be computed from the final velocities? (Maybe I'm not understanding something here?)

@phil-blain
Copy link
Member

phil-blain commented Feb 18, 2022

Two more things:

  • It would be nice to wrap commits messages at 76 chars or so. Vim/Emacs should do this automatically, if you let Git open them (git commit and not git commit -m "message")

  • Once feedback is adressed and the merge conflicts resolved, I think it would be great to see updated tests (check that everything is bit 4 bit) and also timings.

@TillRasmussen
Copy link
Contributor Author

I will check the error

Copy link
Contributor

@apcraig apcraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try to converge quickly whether additional changes are needed so we can get this merged. My main concern continues to be the splitting out of the last subcycle in the loop for "potential" performance reasons. There is little evidence this helps performance, I wouldn't expect it to help performance as the if is at the level of the subcycling (not deeper), and I think it makes the code more difficult to maintain.

Comment on lines 396 to 397
if (ndte > 1) then
do ksub = 1,ndte-1 ! subcycling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be true that the code will stop? Compiler warnings maybe, but setting loop values so a loop won't be executed is done all the time and should be allowed in general. I'm not opposed to the if check, it's not going to affect performance, but I think it's cleaner not being there. Where is the check for ndte<1. It's not in this part of the code, so the if check is needed even less.

Comment on lines 396 to 397
if (ndte > 1) then
do ksub = 1,ndte-1 ! subcycling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a strong motivator is to get rid of if checks in loops. I assume this is for performance. My question continues to be whether we have measured a performance benefit with these changes. If not, then I prefer that we not replicate the calls for the last subcycle step in order to add the deformations call outside an if statement. My suspicion is that the if statement in the subcycling loop is not going to affect performance. Getting rid of if statements inside an ij loop, ok maybe, but not at this high a level in the computations. In my opinion, it's easier to maintain if the last iteration is not separated/duplicated. Maybe there are other reasons for separating that last subcycle. But if not, then I still prefer we not do it.

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Feb 21, 2022

The loop splitting can be avoided if the call to deformations is delayed to after the last stepu.
This will update u, which I assume result in a minor update.
This will cause minor differences to output (divu and shear) and to ridging in icepack through rdg_conv and rdg_shear. Whether this is critical or not I cannot say, but I would think that it is limited
the if (ndte > 1) then is on line 393.
This was put into the 1d solver and I cannot remember if it was in order to avoid warnings or in order to avoid a crash.

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Feb 21, 2022

In terms of perfomance then that is the reason for moving if statements away form loops, however this has not helped as much as desired. This indicates other issues that were solved in the 1d solver (memory and communication). These modifications are also steps towards reducing the numnber of 2d arrays within the loop as deformations has been moved out.
This will also improve the maintainance of the 1d solver as the stress_last can be excluded and the convertions between 2d and 1d can be limited.
The test I made on the DMI computer varies in both directions in terms of performance, however especially IO varies on this machine.

@TillRasmussen
Copy link
Contributor Author

For info
results_base.log
For the last commit I changed a line in seabedLKD but nothing in the probabilistic. I compared timings for these just to illustrate the changes runnning more or less the same. Therefore please compare column 2.
freyja-5 CICE/testsuite.neweap190222> diff results.log results_base.log | grep run
< PASS freya_intel_restart_gx1_16x2_debug_gx1apr_seabedLKD_short run 725.68 447.32 144.01

PASS freya_intel_restart_gx1_16x2_debug_gx1apr_seabedLKD_short run 712.58 420.29 147.39
< PASS freya_intel_restart_gx1_15x2_seabedprob run 173.11 31.79 21.84
PASS freya_intel_restart_gx1_15x2_seabedprob run 90.58 17.91 21.45
< PASS freya_intel_smoke_gx1_15x2_reprosum_run10day_seabedprob build
< PASS freya_intel_smoke_gx1_15x2_reprosum_run10day_seabedprob run 260.26 48.42 43.47
COPY freya_intel_smoke_gx1_15x2_reprosum_run10day_seabedprob build
PASS freya_intel_smoke_gx1_15x2_reprosum_run10day_seabedprob run 200.16 35.83 42.05
< PASS freya_intel_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread build
< PASS freya_intel_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread run 118.61 8.66 20.47
COPY freya_intel_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread build
PASS freya_intel_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread run 96.35 8.58 20.57
< PASS freya_gnu_restart_gx1_16x2_debug_gx1apr_seabedLKD_short run 165.58 46.26 41.10
PASS freya_gnu_restart_gx1_16x2_debug_gx1apr_seabedLKD_short run 132.29 46.11 41.19
< PASS freya_gnu_restart_gx1_15x2_seabedprob run 96.85 6.87 17.43
PASS freya_gnu_restart_gx1_15x2_seabedprob run 67.80 6.87 17.41
< PASS freya_gnu_smoke_gx1_15x2_reprosum_run10day_seabedprob build
< PASS freya_gnu_smoke_gx1_15x2_reprosum_run10day_seabedprob run 178.83 13.79 34.80
COPY freya_gnu_smoke_gx1_15x2_reprosum_run10day_seabedprob build
PASS freya_gnu_smoke_gx1_15x2_reprosum_run10day_seabedprob run 88.72 13.84 34.75
< PASS freya_gnu_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread build
< PASS freya_gnu_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread run 143.34 12.84 28.25
COPY freya_gnu_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread build
PASS freya_gnu_smoke_gx1_28x1_cmplogrest_reprosum_run10day_seabedprob_thread run 79.88 12.89 28.31

@@ -176,7 +176,7 @@ subroutine init_dyn (dt)
if (trim(coriolis) == 'constant') then
fcor_blk(i,j,iblk) = 1.46e-4_dbl_kind ! Hibler 1979, N. Hem; 1/s
else if (trim(coriolis) == 'zero') then
fcor_blk(i,j,iblk) = 0.0
fcor_blk(i,j,iblk) = c0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change in this line the entire reason the code is not BFB for cor=0?

Copy link
Contributor Author

@TillRasmussen TillRasmussen Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is why I put in the comment. I dont think that I reran a test with coriolis=0.0 after the last commit but just a warning if the standard test complains about bit for bit differences.
All other test have been bit for bit with main after Tony's OMP updates.
.

Comment on lines -789 to -792
strintx , & ! divergence of internal ice stress, x (N/m^2)
strinty , & ! divergence of internal ice stress, y (N/m^2)
strairx , & ! stress on ice by air, x-direction
strairy ! stress on ice by air, y-direction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave them there, but commented out.

Comment on lines 396 to 397
if (ndte > 1) then
do ksub = 1,ndte-1 ! subcycling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle with getting rid of if checks in loops. They can make a big difference in performance, but it depends on the loop size and things like how long the pipeline is for alternative commands, which probably varies a good bit across hardware/compiler configurations. Maybe it's gotten better with newer generations. The other considerations here are not having duplicative code (which I also agree with) and making close-but-not-quite-duplicated code for various dynamics options as similar as possible. Since the dynamics is typically the most expensive part of the sea ice simulation, and centers seem to be moving toward higher numbers of subcycles for better convergence, then optimizing these loops seems wise. It's hard to judge, though, if the existing timing tests are inconclusive. Question: in the original code, the deformations routine is called at the end, rather than during the loop. Is the argument for moving it up into the loop (or interrupting the loop) that the computed deformations need to be consistent with the velocities just prior to the last iteration? Or is it not better that the deformations be computed from the final velocities? (Maybe I'm not understanding something here?)

@TillRasmussen
Copy link
Contributor Author

Deformations: It was a part of the stress routine. Therefore the loop would be
ksub=1,ndte
call stress; call deformation; call stepu
end
This version keep the order of calls but do not calculate deformation until last iteration, thus
ksub=1,ndte-1
call stress; call stepu
end
call stress;call deformation;call stepu;
I am not sure if it is important to calculate deformations before or after the last call to stepu. I assume that it is placed where it is as it is calculated on the tcells.
A potential rework would be
ksub=1,ndte
call stress; call stepu
end
call deformation

@apcraig
Copy link
Contributor

apcraig commented Feb 22, 2022

Deformations: It was a part of the stress routine. Therefore the loop would be ksub=1,ndte call stress; call deformation; call stepu end This version keep the order of calls but do not calculate deformation until last iteration, thus ksub=1,ndte-1 call stress; call stepu end call stress;call deformation;call stepu; I am not sure if it is important to calculate deformations before or after the last call to stepu. I assume that it is placed where it is as it is calculated on the tcells. A potential rework would be ksub=1,ndte call stress; call stepu end call deformation

Or it could have an if in the loop. The cost of stress and stepu are significantly greater than the if check in the subcycling loop. I'm in favor of getting rid of if checks inside deep, expensive loops, but I still suspect this "if" doesn't meet that criteria and never will. If ultimately we plan to refactor this so it's more efficient with memory and 1d/2d arrays and this change is needed for that, maybe this change should be implemented then.

@apcraig
Copy link
Contributor

apcraig commented Feb 22, 2022

I'm willing to relent on my point about the duplication of calls and cost of the if statement and approve this PR. Are there other outstanding issues? @TillRasmussen @phil-blain @eclare108213 ? I need to know when it's ready to merge.

@TillRasmussen
Copy link
Contributor Author

I dont have more changes.

@eclare108213
Copy link
Contributor

approved (again)

Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave it up to you, @apcraig and @eclare108213, to decide. I really do not like the duplicated calls in the EVP loop. I'm not convinced they impact the performance since it's a very "outer" loop.

@@ -906,7 +912,9 @@ subroutine seabed_stress_factor_LKD (nx_block, ny_block, &
hwu = min(hwater(i,j),hwater(i+1,j),hwater(i,j+1),hwater(i+1,j+1))

! if (hwu < threshold_hw) then
docalc_tbu = max(sign(c1,threshold_hw-hwu),c0)
! docalc_tbu = max(sign(c1,threshold_hw-hwu),c0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this commented line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. I also squashed the last three commits that were review related

… bit for bit if cor=0.0

added legacy comment in ice_dyn_finish
@eclare108213
Copy link
Contributor

It seems that the prevailing sentiment is to keep the calls compact with an if statement in the loop, at least for now. I am leaning that way myself, but am somewhat ambivalent, as noted above. @TillRasmussen one of your main reasons for making this change in the loop is to make it more similar to the 1D evp case - is that right? If so, how much does that case degrade if you do it this way instead?

@TillRasmussen
Copy link
Contributor Author

I assume that you refer to the placement of the deformation call.

Yes converging towards the 1d solver was part of it. The other part was performance but as the work with the 1d solver suggest then I think that the real bottleneck here is the communication calls. I browsed through a number of the test runs and it is often 1/3 of the total run time. Memory issues may be hidden for now in these small test cases with gx3 and gx1.
Moving as much of the diagnostics (deformations) away from the subcyling has the additional benefit that the conversion from 2d to 1d and vice versa would be simpler as eg. divergence is only calculated once outside the loop, thus one could probably skip this transformation for this and most other variables in the 1d loop. This is done in copy_in2d_1d.

At the same time many of the variables within the evp solver are really local to ice_dyn_evp and ice_dyn_shared and only used in either loops for ucells or tcells. These could with benefit be converted to 1d (or 2d if block structure is maintained).

Performance of 1d and multiple calls to deformation
I have not tested it but I am sure that Jacob did so when writing it a few years back. This was why he made the interface structure that essentially separated a version of stress with deformations and one without. It was less obvious what could be moved from stress as the functions strain_rates and deformations did not exist.
I would be very hesitant to reintroduce deformations into the loop of the 1d, however one can always try.

My preferred option would be to verify whether deformations could be moved to after the last call to stepu.

My second option would be to leave it as is but until I get around to clean-up the 1d solver to include deformation and strain rates then it is not that important and it may very well be true that the difference in performance is not that big.

@eclare108213
Copy link
Contributor

My second option would be to leave it as is but until I get around to clean-up the 1d solver to include deformation and strain rates then it is not that important and it may very well be true that the difference in performance is not that big.

Does as-is mean the original coding order, or the way it is in this PR?

@TillRasmussen
Copy link
Contributor Author

As in the PR.
The original code had deformations embedded into stress. I would not like to go back to that.
The 3rd alternative in my opinion is call deformations between stress and stepu.
That is just my preference :)

@eclare108213
Copy link
Contributor

Okay, I see. I'm going to make an executive decision, to keep this moving forward: please call deformations within the loop between stress and stepu, with an if statement. That seems to be the closest to the original code, and we can change it later once we all agree on the best way to do it. I do agree that the preferred option would be to verify that deformations could simply be moved outside of the loop, without calling stress and stepu again. I suspect that it can, but let's put that in an issue for a future PR. Maybe it could be tested as part of the 1D cleanup and fixed appropriately in both approaches.

@eclare108213
Copy link
Contributor

Thank you @TillRasmussen ! Let us know when you're ready for this to be merged (tested, etc).

@TillRasmussen
Copy link
Contributor Author

I think that it is fine now. I ran through one test and it was bit for bit. I can also run it through all ~1500 test that are currently setup, however this will be ready in 3-4 hours, thus that would be tomorrow

@apcraig
Copy link
Contributor

apcraig commented Feb 22, 2022

Thank you @TillRasmussen for the latest changes. I think it's great to get the deformations call out of stress and to have the current implementation. I was just looking at vp and eap and they seem to handle the deformations call differently (eap doesn't seem to call it at all), so it shouldn't affect the other methods. I also had a quick peak at the 1d solver again. It's quite different structurally. I don't know how important it is to try to make the 1d and 2d more alike. I guess one thing that might be nice would be to find a way to use the same code to compute the underlying science, but I don't think that's going to be possible given the very different memory structure of the two versions. To me, that's the bigger risk, that someone updates the 2d implementation and not the 1d version, so the solutions then diverge. Should we be considering how to address that issue?

As a sanity check, I will run a test suite with the latest code on cheyenne to check bit-for-bit. It it passes, I will approve the PR and then merge it (unless anyone else has any concerns and expresses them here). Thanks.

@TillRasmussen
Copy link
Contributor Author

@apcraig They are not that different as they appear.
eg. Function stress
Geometrical parameters can be calculated as scalars as it is the case in the 1d solver
Function strain_rate could easily be rewritten to accept scalars as input.
viscous_coeffs_and_rep_pressure already do this
stress_iter and stress_last can be merged to one now that deformations are out of stress.

The issue is with
stressp and stressm are 1d/2d
str is split in 8 and it is in 1d/2d
where there is a 1d/2d issue.

@apcraig
Copy link
Contributor

apcraig commented Feb 23, 2022

@apcraig apcraig merged commit a59fa47 into CICE-Consortium:main Feb 23, 2022
@TillRasmussen TillRasmussen deleted the neweap branch February 23, 2022 21:50
apcraig added a commit to apcraig/CICE that referenced this pull request Mar 8, 2022
* update icepack, rename snwITDrdg to snwitdrdg (CICE-Consortium#658)

* Change max_blocks for rake tests on izumi (nothread). (CICE-Consortium#665)

* Fix some raketests for izumi

* fix some rake tests

* Makefile: make Fortran object files depend on their dependency files (CICE-Consortium#667)

When 'make' is invoked on the CICE Makefile, the first thing it does is
to try to make the included dependency files (*.d) (which are in fact
Makefiles themselves) [1], in alphabetical order.

The rule to make the dep files have the dependency generator, 'makdep',
as a prerequisite, so when processing the first dep file, make notices
'makdep' does not exist and proceeds to build it. If for whatever reason
this compilation fails, make will then proceed to the second dep file,
notice that it recently tried and failed to build its dependency
'makdep', give up on the second dep file, proceed to the third, and so
on.

In the end, no dep file is produced. Make then restarts itself and
proceeds to build the code, which of course fails catastrophically
because the Fortran source files are not compiled in the right order
because the dependency files are missing.

To avoid that, add a dependency on the dep file to the rules that make
the object file out of the Fortran source files. Since old-fashioned
suffix rules cannot have their own prerequisites [2], migrate the rules
for the Fortran source files to use pattern rules [3] instead. While at
it, also migrate the rule for the C source files.

With this new dependency, the builds abort early, before trying to
compile the Fortran sources, making it easier to understand what
has gone wrong.

Since we do not use suffix rules anymore, remove the '.SUFFIXES' line
that indicates which extension to use suffix rules for (but keep the
line that eliminates all default suffix rules).

[1] https://www.gnu.org/software/make/manual/html_node/Remaking-Makefiles.html
[2] https://www.gnu.org/software/make/manual/html_node/Suffix-Rules.html
[3] https://www.gnu.org/software/make/manual/html_node/Pattern-Rules.html#Pattern-Rules

* Fix multi-pe advection=none bug (CICE-Consortium#664)

* update parsing scripts to improve robustness, fix multi-pe advection=none

* Update cice script to improve performance including
minor refactoring of parse_namelist and parse_settings
to reduce cost and ability to use already setup ice_in file
from a prior case in the suite.

Added commented out timing ability in cice.setup.

Change test default to PEND from FAIL.

* fix cice.setup for case

* add sedbak implementation to support Mac sed

* s/spend/spent

* nuopc/cmeps driver updates (CICE-Consortium#668)


* add debug_model feature
* add required variables and calls for tr_snow

* Main namelist debug (CICE-Consortium#671)

* Adding method to write erroneous namelist options

* Remove erroneous comma in abort_ice for namelist check

* Added check for zbgc_nml. I missed that namelist in this file.

* Added space and colons for namelist error output

* Added space and colons for namelist error output

Co-authored-by: David A. Hebert <[email protected]>

* NUOPC/CMEPS cap updates (CICE-Consortium#670)

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* Update NUOPC cap to work with latest CICE6 master

* nuopc,cmeps or s2s build updates

* fixes for dbug_flag

* Update nuopc2 to latest CICE master

* Fix some merge problems

* Fix dbug variable

* Manual merge of UFS changes

* fixes to get CESM B1850 compset working

* refactored ice_prescribed_mod.F90 to work with cdeps rather than the mct data models

* Fix use_restart_time

* changes for creating masks at runtime

* added ice_mesh_mod

* implemented area correction factors as option

* more cleanup

* Fix dragio

* Fix mushy bug

* updates to nuopc cap to resolve inconsistency between driver inputs and cice namelists

* changed error message

* added icepack_warnings_flush

* updates to get ice categories working

* updates to have F compset almost working with cice6 - still problems in polar regions - need to resolve 253K/cice6 versus 273K/cice5 differences

* changed tolerance of mesh/grid comparison

* added issues raised in PR

* Update CESM-CICE sync with new time manager

* Add back in latlongrid

* Add new advanced snow physics to driver

* Fix restart issue with land blocks

* Update mesh check in cap

* fix scam problems

* reintroduced imesh_eps check

* Put dragio in the namelist instead

* Remove redundant code

* Fix some indents

Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: apcraig <[email protected]>
Co-authored-by: Denise Worthen <[email protected]>

* Add CESM1_PIO for fill value check (CICE-Consortium#675)

* Add CESM1_PIO for fill value check

* Revert PIO_FILL_DOUBLE change for now

* - Update the namelist read to make the group order flexible. (CICE-Consortium#677)

- Remove recent update to namelist read that traps bad lines, it conflicts
  with flexibility to read groups in random order picked up by NAG.
- change print* statements to write(nu_diag,*)

* Port to Narwhal and add perf_suite (CICE-Consortium#678)

* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite

* Update OMP (CICE-Consortium#680)

* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite

* Review and update OMP implementation

- Fix call to timers in block loops
- Fix some OMP Private variables
- Test OMP Scheduling, add SCHEDULE(runtime) to some OMP loops
- Review column and advection OMP implementation
- ADD OMP_TIMERS CPP option (temporary) to time threaded sections
- Add timer_tmp timers (temporary)
- Add omp_suite.ts test suite
- Add ability to set OMP_SCHEDULE via options (ompscheds, ompscheds1, ompschedd1)

* - Review diagnostics OMP implementation
- Add timer_stats namelist to turn on extra timer output information
- Add ICE_BFBTYPE and update bit-for-bit comparison logic in scripts
- Update qc and logbfb testing
- Remove logbfb and qchkf tests, add cmplog, cmplogrest, cmprest set_env files to set ICE_BFBTYPE
- Update documentation

* Update EVP OMP implementation

* - Refactor puny/pi scalars in eap dynamics to improve performance
- Update OMP in evp and eap

* Clean up

* Comment out temporary timers

* Update OMP env variables on Narwhal

* Update gaffney OMP_STACKSIZE

* update OMP_STACKSIZE on cori

* Update Onyx OMP_STACKSIZE
Update documentation

* Update OMP_STACKSIZE on mustang

* - Update Tsfc values on land in various places in the code, was affecting testing.  Specifically fix upwind advection.
- Comment out OMP in ice_dyn_evp_1d_kernel, was producing non bit-for-bit results with different thread counts

* updating LICENSE.pdf for 2022

* seabed stress - remove if statements (CICE-Consortium#673)

* refactor seabed_stress. Bit for bit

* Removed if statement from stepu. Results are binary identical, however taubx and tauby is updated on all iterations instead of just the last one. Not used within iteration

* changed capping from logical to numeric in order to remove if statement. Moved call to deformation out of loop

* clean dyn_finish, correct intent(inout) to intent(in) for Cw, resse Cb in stepu, remove if from seabed_stress_LKD

* Reolve conflicts after updating main

* modified environment for Freya to accomodate for additional OMP commands

* Requested changes after review. Only changed in seabed stress and not bit for bit if cor=0.0

added legacy comment in ice_dyn_finish

* move deformation to subcycling

* - Update version and copyright. (CICE-Consortium#691)

- Remove gordon and conrad machines.
- Add setenv OMP_STACKSIZE commented out in env files
- Update Icepack to fc4b809

* add OMP_STACKSIZE for koehr (CICE-Consortium#693)

* Update C/CD deformations calls to be consistent with main B changes
Update tauxbx, tauxby calculations on C/CD to be consistent with main B changes

* Update OpenMP in C/CD implementation
Extend omp_suite to include C/CD tests

* reconcile recent merge problem

* set default value of capping to 0. in vp cases for backwards compatibility

* Set capping to 1.0 in vp consistent with evp, changes answers for vp configurations

Co-authored-by: David A. Bailey <[email protected]>
Co-authored-by: Philippe Blain <[email protected]>
Co-authored-by: Denise Worthen <[email protected]>
Co-authored-by: daveh150 <[email protected]>
Co-authored-by: David A. Hebert <[email protected]>
Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>
Co-authored-by: TRasmussen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants