-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fusing kernels for calculating diagnostics to improve performance #1483
Comments
The library computes the derivatives that are required to compute the tendencies, but they are not stored since that would not be very efficient. Getting some of these values but I don't know what that would look like. Also, if you are not computing this field at every time step, the cost of computing it sepratelyl might not be that high, but that of course depends on the particular problem you are dealing with. |
For better or for worse, Oceananigans currently does not store intermediate terms in the computation of a PDE's right hand side (with notable exceptions hydrostatic pressure and eddy diffusivities). In other words, a single, sometimes large kernel that evaluates the right hand side at each grid point It's important to note when considering optimization strategies that our computations are probably memory-limited, rather than compute-limited. In other words, we think the process of transferring data from global memory to local thread memory is a bottleneck for our computations (we can really only know this through profiling a particular application, however, since all models are different...) Storing intermediate components of the tendency terms would probably create more memory accesses overall (since rather than immediately using intermediate results for subsequent calculations, we would have to send them to global memory, and then back, to complete the evaluation of a tendency) --- and thus could slow down tendency evaluations that are performed 1-3 times per time-step. For example, our best idea for speeding up tendency evaluations is to better manage memory movement using GPU shared memory (unfortunately, we haven't had the time to explore such optimization strategies...) I think there may be other ways to optimize diagnostics calculations, however. Fusing
|
@kernel function _compute!(data, operand) | |
i, j, k = @index(Global, NTuple) | |
@inbounds data[i, j, k] = operand[i, j, k] | |
end |
where operand
is an AbstractOperation
. But different ComputedField
s may somehow depend on the same underlying data in memory. Thus if the kernels for differnet ComputedField
s are fused into one, we overlap memory accesses for different computations. Our computations are usually memory-limited... so its possible this strategy could produce significant speed ups. For example, for two ComputedField
s we might have something like
function compute!(field1, field2)
# calls _compute_two(field1.data, field2.data, field1.operand, field2.operand)
end
and the kernel
@kernel function _compute_two!(data1, data2, operand1, operand2)
i, j, k = @index(Global, NTuple)
@inbounds data1[i, j, k] = operand1[i, j, k]
@inbounds data2[i, j, k] = operand2[i, j, k]
end
There should also be a way to generalize to the nth case using some ntuple
magic.
(Note that we tried this with tracer kernels previously without obtaining any speed up, but overlapping ComputedField
s could be a more promising application of this technique.)
Using mapreduce
for averaging AbstractOperations
We also might be able to apply mapreduce
directly to AbstractOperations
, rather than using our currently strategy of storing the intermediate result of an operation in a ComputedField
, and then averaging the intermediate result. This technique is discussed on #1422 .
Oh yeah, for sure! The idea is that they would be stored only if the user specified interest in them. That way instead of calculating twice and storing once, the code would calculate and store once. The default behavior still would be calculating and not storing though. |
That's interesting. I wasn't aware there was only one kernel for the whole RHS. That's pretty smart.
That's a really cool idea. I can see how we could do that with |
I think we understand how to write a Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl Lines 195 to 196 in 9b52f3f
where Oceananigans.jl/src/OutputWriters/fetch_output.jl Lines 13 to 16 in 9b52f3f
which in turn calls But we want to trigger one call to |
And whatever we come up with it also has to work with
|
For example, the function called in the kernel for "u" (x-velocity) is Oceananigans.jl/src/Models/IncompressibleModels/velocity_and_tracer_tendencies.jl Lines 44 to 66 in 9b52f3f
|
Maybe a cleaner solution than adding a property to |
So, I've been running some simulations with a bunch of diagnostics, which make the simulation slower. I think some of the diagnostics that I'm calculating are already being calculated by Oceananigans (like derivatives). If I understand correctly, in each of those cases Oceananigans calculates that variable once to get the tendencies and then does the same calculation again when for my diagnostics, which seems wasteful.
Is it possible (or desirable) to create a way for the code not to do the calculation twice if the user wants that diagnostic specifically? Maybe pass options when creating the model like:
so that those fields get stored just like the velocity and the tracers do?
Thoughts?
The text was updated successfully, but these errors were encountered: