-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cam6_4_050: clubb_intr GPUization #1175
base: cam_development
Are you sure you want to change the base?
cam6_4_050: clubb_intr GPUization #1175
Conversation
…e options from running when compiled with OpenACC
… by breaking BFBness. All these values are ~0 but not always exactly.
…t and BFB on CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!
! to zero. | ||
fcor(:) = 0._r8 | ||
! Set the ztodt timestep in pbuf for SILHS | ||
ztodtptr(:) = 1.0_r8*hdtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a question for @Katetc, but why multiply by one instead of just using hdtime
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why this is, but I was wondering the same thing. The comment implies it's only for SILHS though, which has a dubious future in cam, so I left it untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very vaguely remember some kind of PGI compiler issue with pointers and some PBuf variables like 10 years ago... and this was some kind of tacky patch. I'm sure you could remove this multiplier without an issue now.
!$acc state1, state1%q, state1%u, state1%v, state1%t, state1%pmid, state1%s, state1%pint, & | ||
!$acc state1%zm, state1%zi, state1%pdeldry, state1%pdel, state1%omega, state1%phis, & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious as to why you have to bring in state1
in addition to the relevant state1
variables (e.g. state1%q
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I don't know either anymore. I remember at one point having trouble with this and finding out that you did have to allocate the derived type itself on the GPU (create/copyin), so I just keep doing it that way, but I discovered recently that it's completely unnecessary. However, I'm not sure if the lack of such need is something nvidia has implemented, or if it's part of the API standard, or if the trouble I encountered forever ago was just a bug. Common advice from the internet is to copy the user-defined type in itself, but that might be for types that contain scalars or fixed sized arrays, since those values are contained in the memory storing the type variable.
So long story short - just in case we need it.
… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.
Good finds, I took you up on every suggestion. |
I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code. Here's the timing output comparison now. I ran these a couple times and got roughly the same results. From the cam_dev head I started with (cam6_4_038)
From the head of this branch (75f51d1)
So it seems these changes made I ran all the same tests mentioned above to confirm these new changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!
My comments are mostly questions so I can understand what is going on as best as possible.
I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.
sclrm_forcing, & ! Passive scalar forcing [{units vary}/s] | ||
sclrm, & ! Passive scalar mean (thermo. levels) [units vary] | ||
sclrp2, & ! sclr'^2 (momentum levels) [{units vary}^2] | ||
sclrp3, & ! sclr'^3 (thermo. levels) [{units vary}^3] | ||
sclrprtp, & ! sclr'rt' (momentum levels) [{units vary} (kg/kg)] | ||
sclrpthlp, & ! sclr'thlp' (momentum levels) [{units vary} (K)] | ||
wpsclrp ! w'sclr' (momentum levels) [{units vary} m/s] | ||
wpsclrp, & ! w'sclr' (momentum levels) [{units vary} m/s] | ||
sclrpthvp_inout ! sclr'th_v' (momentum levels) [{units vary} (K)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable had a straight "pverp" vertical dimension before, so if "top_lev" is not 1 then it will be dimensioned differently here... however, I think that's correct. Not sure it was meant to be at all levels no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it's dimensioned differently, but that was intentional. I noticed that we never used the whole array, only ever up to pverp+1-toplev
, which is now nzm_clubb
, so I set it to that instead.
if ( single_column .and. .not. scm_cambfb_mode ) then | ||
call endrun(subr//': (single_column && !scm_cambfb_mode)=.true. not available when compiling with OpenACC') | ||
end if | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still learning about GPUs, but I'm curious why we have to have an endrun here instead of just running non-GPU code on the CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first instinct too. It would take some of acc update host
and acc update device
directives to move the data we need back and forth inside of the if-statements, and there would be no issue with doing so. The reason I didn't do that is because I was testing the GPU results with a CPU ensemble baseline that I had run with those options turned off. So to confirm those changes I would've had to run another baseline with those options turned on, which would've taken at least another day, and I was trying to get this in quickly. Also I asked Vince about the options and he indicated they weren't run often or were more so test options that didn't turn out well.
So I thought it was safest to leave the code in those sections untouched and just disable the options for GPU runs. I did do a little test to confirm the CPU code ran to completion with them enabled though, except clubb_do_adv
, which didn't run due to begin with due to some error in cnst_add
called from clubb_register_cam
.
|
||
endif | ||
|
||
call t_startf('clubb_tend_cam:flip-index') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I might be confused, but it looks like this little else section above isn't included in a timer. Is that what we want? It's probably not a big deal, but I think timers should cover everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole if statement is inside of a larger timing section labelled "ACCR" (for accelerated region). The if
part isn't GPUized though, so it ends the "ACCR" timer (accelerated region) and starts the "NAR" timer (non-accelerated region), then does the opposite at the end. This was needed because the if
bit contains a call to calc_ustar
, which I didn't want to GPUize (just so that all changes were contained in clubb_intr), but the else
region is GPUized, so it's cost is included in the surrounding ACCR
timer.
src/physics/cam/clubb_intr.F90
Outdated
do i=1,ncol | ||
rvm_in(i,k) = rtm_in(i,k) - rcm_inout(i,k) | ||
end do | ||
end do | ||
|
||
call update_xp2_mc_api( gr, nlev+1, ncol, dtime, cloud_frac_inout, & | ||
call t_startf('clubb_tend_cam:update_xp2_mc_api') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this timer is just around the call to update_xp2_mc_api, but I kind of think it should include the double loop above that is called right before every time. Just to include all related calculations. Or add a whole "rain_turb" timer around this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "rain_turb" timer makes the most sense. I don't know why I didn't do that to begin with, I've changed the timers in this commit b1a243c
if (macmic_it==1) vpwp_clubb_gw_mc(:ncol,:) = 0._r8 | ||
if (macmic_it==1) thlp2_clubb_gw_mc(:ncol,:) = 0._r8 | ||
if (macmic_it==1) wpthlp_clubb_gw_mc(:ncol,:) = 0._r8 | ||
call t_stopf('clubb_tend_cam:flip-index') | ||
|
||
do t=1,nadv ! do needed number of "sub" timesteps for each CAM step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop - Its kind of hard to tell but I think it's a larger loop for the cld_mac_mic run. Shouldn't this have an !acc tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the overall sub-stepping loop, so it has to be run sequentially. It also contains the call to advance_clubb_core
and some other smaller routines, so it wouldn't be possibly to offload to the GPU.
…st the function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was curious, I took a cursory look at this PR. I did find one minor item which I believe would be good to add (since it took me awhile to find the answer). This is not intended to be a full review.
nzm_clubb, & | ||
nzt_clubb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a comment be added to describe what each of these new variables are? (I saw they were documented in the code, but it is somewhat common to document them when they are declared).
This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of
acc
directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb
andnzt_clubb
) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (
do_clubb_mf
,do_rainturb
,do_cldcool
,clubb_do_icesuper
,single_column
), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.