Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cam6_4_050: clubb_intr GPUization #1175

Open
wants to merge 15 commits into
base: cam_development
Choose a base branch
from

Conversation

huebleruwm
Copy link

This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of acc directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb and nzt_clubb) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.

This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (do_clubb_mf, do_rainturb, do_cldcool, clubb_do_icesuper, single_column ), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.

If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.

@Katetc Katetc self-requested a review October 21, 2024 21:43
Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
! to zero.
fcor(:) = 0._r8
! Set the ztodt timestep in pbuf for SILHS
ztodtptr(:) = 1.0_r8*hdtime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a question for @Katetc, but why multiply by one instead of just using hdtime directly?

Copy link
Author

@huebleruwm huebleruwm Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this is, but I was wondering the same thing. The comment implies it's only for SILHS though, which has a dubious future in cam, so I left it untouched.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very vaguely remember some kind of PGI compiler issue with pointers and some PBuf variables like 10 years ago... and this was some kind of tacky patch. I'm sure you could remove this multiplier without an issue now.

Comment on lines +2863 to +2864
!$acc state1, state1%q, state1%u, state1%v, state1%t, state1%pmid, state1%s, state1%pint, &
!$acc state1%zm, state1%zi, state1%pdeldry, state1%pdel, state1%omega, state1%phis, &
Copy link
Collaborator

@nusbaume nusbaume Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious as to why you have to bring in state1 in addition to the relevant state1 variables (e.g. state1%q)?

Copy link
Author

@huebleruwm huebleruwm Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't know either anymore. I remember at one point having trouble with this and finding out that you did have to allocate the derived type itself on the GPU (create/copyin), so I just keep doing it that way, but I discovered recently that it's completely unnecessary. However, I'm not sure if the lack of such need is something nvidia has implemented, or if it's part of the API standard, or if the trouble I encountered forever ago was just a bug. Common advice from the internet is to copy the user-defined type in itself, but that might be for types that contain scalars or fixed sized arrays, since those values are contained in the memory storing the type variable.

So long story short - just in case we need it.

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.
@huebleruwm
Copy link
Author

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

Good finds, I took you up on every suggestion.

@huebleruwm
Copy link
Author

huebleruwm commented Oct 28, 2024

I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code.

Here's the timing output comparison now. I ran these a couple times and got roughly the same results.

From the cam_dev head I started with (cam6_4_038)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 135.3464    124.0819    366     146.0524    42     
  clubb_tend_cam                               512    512    MULTIPLE 132.0139    121.0235    366     142.8016    166    
    clubb_tend_cam_i_loop                      512    512    MULTIPLE 111.1615    102.4363    366     123.0577    166    

From the head of this branch (75f51d1)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 132.7466    123.5249    366     144.4218    149    
  clubb_tend_cam                               512    512    MULTIPLE 129.4563    120.4303    366     140.8711    149    
    clubb_tend_cam:ACCR                        512    512    MULTIPLE 107.8117    100.6424    366     117.3765    149    
      clubb_tend_cam:advance_clubb_core_api    512    512    MULTIPLE 91.2902     84.0197     366     100.7022    149    
      clubb_tend_cam:flip-index                512    512    MULTIPLE 11.2684     9.9628      220     13.7559     455    
   clubb_tend_cam:NAR                          512    512    MULTIPLE 21.4333     19.0054     401     24.0401     76     
                      qneg3                    512    512    MULTIPLE 0.5783      0.5268      474     0.6311      66     
   clubb_tend_cam:acc_copyin                   512    512    MULTIPLE 0.0059      0.0045      358     0.0081      260    
   clubb_tend_cam:acc_copyout                  512    512    MULTIPLE 0.0038      0.0025      314     0.0061      149    

So it seems these changes made clubb_tend_cam faster by ~2%.

I ran all the same tests mentioned above to confirm these new changes.

Copy link
Collaborator

@Katetc Katetc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!

My comments are mostly questions so I can understand what is going on as best as possible.

I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.

sclrm_forcing, & ! Passive scalar forcing [{units vary}/s]
sclrm, & ! Passive scalar mean (thermo. levels) [units vary]
sclrp2, & ! sclr'^2 (momentum levels) [{units vary}^2]
sclrp3, & ! sclr'^3 (thermo. levels) [{units vary}^3]
sclrprtp, & ! sclr'rt' (momentum levels) [{units vary} (kg/kg)]
sclrpthlp, & ! sclr'thlp' (momentum levels) [{units vary} (K)]
wpsclrp ! w'sclr' (momentum levels) [{units vary} m/s]
wpsclrp, & ! w'sclr' (momentum levels) [{units vary} m/s]
sclrpthvp_inout ! sclr'th_v' (momentum levels) [{units vary} (K)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable had a straight "pverp" vertical dimension before, so if "top_lev" is not 1 then it will be dimensioned differently here... however, I think that's correct. Not sure it was meant to be at all levels no matter what.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it's dimensioned differently, but that was intentional. I noticed that we never used the whole array, only ever up to pverp+1-toplev, which is now nzm_clubb, so I set it to that instead.

if ( single_column .and. .not. scm_cambfb_mode ) then
call endrun(subr//': (single_column && !scm_cambfb_mode)=.true. not available when compiling with OpenACC')
end if
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still learning about GPUs, but I'm curious why we have to have an endrun here instead of just running non-GPU code on the CPUs?

Copy link
Author

@huebleruwm huebleruwm Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first instinct too. It would take some of acc update host and acc update device directives to move the data we need back and forth inside of the if-statements, and there would be no issue with doing so. The reason I didn't do that is because I was testing the GPU results with a CPU ensemble baseline that I had run with those options turned off. So to confirm those changes I would've had to run another baseline with those options turned on, which would've taken at least another day, and I was trying to get this in quickly. Also I asked Vince about the options and he indicated they weren't run often or were more so test options that didn't turn out well.

So I thought it was safest to leave the code in those sections untouched and just disable the options for GPU runs. I did do a little test to confirm the CPU code ran to completion with them enabled though, except clubb_do_adv, which didn't run due to begin with due to some error in cnst_add called from clubb_register_cam.


endif

call t_startf('clubb_tend_cam:flip-index')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I might be confused, but it looks like this little else section above isn't included in a timer. Is that what we want? It's probably not a big deal, but I think timers should cover everything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole if statement is inside of a larger timing section labelled "ACCR" (for accelerated region). The if part isn't GPUized though, so it ends the "ACCR" timer (accelerated region) and starts the "NAR" timer (non-accelerated region), then does the opposite at the end. This was needed because the if bit contains a call to calc_ustar, which I didn't want to GPUize (just so that all changes were contained in clubb_intr), but the else region is GPUized, so it's cost is included in the surrounding ACCR timer.

do i=1,ncol
rvm_in(i,k) = rtm_in(i,k) - rcm_inout(i,k)
end do
end do

call update_xp2_mc_api( gr, nlev+1, ncol, dtime, cloud_frac_inout, &
call t_startf('clubb_tend_cam:update_xp2_mc_api')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this timer is just around the call to update_xp2_mc_api, but I kind of think it should include the double loop above that is called right before every time. Just to include all related calculations. Or add a whole "rain_turb" timer around this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "rain_turb" timer makes the most sense. I don't know why I didn't do that to begin with, I've changed the timers in this commit b1a243c

if (macmic_it==1) vpwp_clubb_gw_mc(:ncol,:) = 0._r8
if (macmic_it==1) thlp2_clubb_gw_mc(:ncol,:) = 0._r8
if (macmic_it==1) wpthlp_clubb_gw_mc(:ncol,:) = 0._r8
call t_stopf('clubb_tend_cam:flip-index')

do t=1,nadv ! do needed number of "sub" timesteps for each CAM step
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop - Its kind of hard to tell but I think it's a larger loop for the cld_mac_mic run. Shouldn't this have an !acc tag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the overall sub-stepping loop, so it has to be run sequentially. It also contains the call to advance_clubb_core and some other smaller routines, so it wouldn't be possibly to offload to the GPU.

@peverwhee peverwhee changed the title clubb_intr GPUization cam6_4_048: clubb_intr GPUization Nov 11, 2024
@peverwhee peverwhee changed the title cam6_4_048: clubb_intr GPUization cam6_4_049: clubb_intr GPUization Nov 12, 2024
@peverwhee peverwhee changed the title cam6_4_049: clubb_intr GPUization cam6_4_050: clubb_intr GPUization Nov 13, 2024
Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was curious, I took a cursory look at this PR. I did find one minor item which I believe would be good to add (since it took me awhile to find the answer). This is not intended to be a full review.

Comment on lines +63 to +64
nzm_clubb, &
nzt_clubb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a comment be added to describe what each of these new variables are? (I saw they were documented in the code, but it is somewhat common to document them when they are declared).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants