Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusing kernels for calculating diagnostics to improve performance #1483

Closed
tomchor opened this issue Mar 16, 2021 · 8 comments
Closed

Fusing kernels for calculating diagnostics to improve performance #1483

tomchor opened this issue Mar 16, 2021 · 8 comments
Labels
performance 🏍️ So we can get the wrong answer even faster

Comments

@tomchor
Copy link
Collaborator

tomchor commented Mar 16, 2021

So, I've been running some simulations with a bunch of diagnostics, which make the simulation slower. I think some of the diagnostics that I'm calculating are already being calculated by Oceananigans (like derivatives). If I understand correctly, in each of those cases Oceananigans calculates that variable once to get the tendencies and then does the same calculation again when for my diagnostics, which seems wasteful.

Is it possible (or desirable) to create a way for the code not to do the calculation twice if the user wants that diagnostic specifically? Maybe pass options when creating the model like:

model = IncompressibleModel(grid, computed_fields=(:dudz, :dvdz,))

so that those fields get stored just like the velocity and the tracers do?

Thoughts?

@francispoulin
Copy link
Collaborator

The library computes the derivatives that are required to compute the tendencies, but they are not stored since that would not be very efficient. Getting some of these values but I don't know what that would look like. Also, if you are not computing this field at every time step, the cost of computing it sepratelyl might not be that high, but that of course depends on the particular problem you are dealing with.

@glwagner
Copy link
Member

For better or for worse, Oceananigans currently does not store intermediate terms in the computation of a PDE's right hand side (with notable exceptions hydrostatic pressure and eddy diffusivities). In other words, a single, sometimes large kernel that evaluates the right hand side at each grid point i, j, k is compiled for each PDE in a user's model, and there's no way to pull out intermediate steps in that calculation to use elsewhere.

It's important to note when considering optimization strategies that our computations are probably memory-limited, rather than compute-limited. In other words, we think the process of transferring data from global memory to local thread memory is a bottleneck for our computations (we can really only know this through profiling a particular application, however, since all models are different...) Storing intermediate components of the tendency terms would probably create more memory accesses overall (since rather than immediately using intermediate results for subsequent calculations, we would have to send them to global memory, and then back, to complete the evaluation of a tendency) --- and thus could slow down tendency evaluations that are performed 1-3 times per time-step. For example, our best idea for speeding up tendency evaluations is to better manage memory movement using GPU shared memory (unfortunately, we haven't had the time to explore such optimization strategies...)

I think there may be other ways to optimize diagnostics calculations, however.

Fusing ComputedField kernels

One possibility to speed up diagnostics is to "fuse" kernels for different ComputedField diagnostics. The kernel for a ComputedField is

@kernel function _compute!(data, operand)
i, j, k = @index(Global, NTuple)
@inbounds data[i, j, k] = operand[i, j, k]
end

where operand is an AbstractOperation. But different ComputedFields may somehow depend on the same underlying data in memory. Thus if the kernels for differnet ComputedFields are fused into one, we overlap memory accesses for different computations. Our computations are usually memory-limited... so its possible this strategy could produce significant speed ups. For example, for two ComputedFields we might have something like

function compute!(field1, field2)
    # calls _compute_two(field1.data, field2.data, field1.operand, field2.operand)
end

and the kernel

@kernel function _compute_two!(data1, data2, operand1, operand2)
    i, j, k = @index(Global, NTuple)
    @inbounds data1[i, j, k] = operand1[i, j, k]
    @inbounds data2[i, j, k] = operand2[i, j, k]
end

There should also be a way to generalize to the nth case using some ntuple magic.

(Note that we tried this with tracer kernels previously without obtaining any speed up, but overlapping ComputedFields could be a more promising application of this technique.)

Using mapreduce for averaging AbstractOperations

We also might be able to apply mapreduce directly to AbstractOperations, rather than using our currently strategy of storing the intermediate result of an operation in a ComputedField, and then averaging the intermediate result. This technique is discussed on #1422 .

@tomchor
Copy link
Collaborator Author

tomchor commented Mar 16, 2021

The library computes the derivatives that are required to compute the tendencies, but they are not stored since that would not be very efficient

Oh yeah, for sure! The idea is that they would be stored only if the user specified interest in them. That way instead of calculating twice and storing once, the code would calculate and store once. The default behavior still would be calculating and not storing though.

@tomchor
Copy link
Collaborator Author

tomchor commented Mar 16, 2021

In other words, a single, sometimes large kernel that evaluates the right hand side at each grid point i, j, k is compiled for each PDE in a user's model, and there's no way to pull out intermediate steps in that calculation to use elsewhere.

That's interesting. I wasn't aware there was only one kernel for the whole RHS. That's pretty smart.

One possibility to speed up diagnostics is to "fuse" kernels for different ComputedField diagnostics.

That's a really cool idea. I can see how we could do that with KernelComputedFields, but I can't really see how to do it for ComputedFields. It would definitely be great if that could be easily automatized.

@glwagner
Copy link
Member

That's a really cool idea. I can see how we could do that with KernelComputedFields, but I can't really see how to do it for ComputedFields.

I think we understand how to write a compute! function that fuses the kernels for multiple ComputedFields or KernelComputedFields. But the harder part is designing an API that implements fusion for an output writer. We might need to add a property called something like fused_compute to both OutputWriters. For example, JLD2OutputWriter fetches output in a loop:

tc = Base.@elapsed data = Dict((name, fetch_and_convert_output(output, model, writer)) for (name, output)
in zip(keys(writer.outputs), values(writer.outputs)))

where fetch_and_convert_output calls fetch_output:

function fetch_output(field::AbstractField, model, field_slicer)
compute_at!(field, time(model))
return slice_parent(field_slicer, field)
end

which in turn calls compute_at!.

But we want to trigger one call to compute_at! for all the outputs at the same time ... I think ... so that requires some refactoring and API design.

@glwagner
Copy link
Member

And whatever we come up with it also has to work with WindowedTimeAverage, which calls fetch_output here:

integrand = fetch_output(wta.operand, model, wta.field_slicer)

@glwagner
Copy link
Member

glwagner commented Mar 16, 2021

I wasn't aware there was only one kernel for the whole RHS.

For example, the function called in the kernel for "u" (x-velocity) is

@inline function u_velocity_tendency(i, j, k, grid,
advection,
coriolis,
stokes_drift,
closure,
background_fields,
velocities,
tracers,
diffusivities,
forcings,
hydrostatic_pressure,
clock)
return ( - div_Uu(i, j, k, grid, advection, velocities, velocities.u)
- div_Uu(i, j, k, grid, advection, background_fields.velocities, velocities.u)
- div_Uu(i, j, k, grid, advection, velocities, background_fields.velocities.u)
- x_f_cross_U(i, j, k, grid, coriolis, velocities)
- ∂xᶠᵃᵃ(i, j, k, grid, hydrostatic_pressure)
+ ∂ⱼ_2ν_Σ₁ⱼ(i, j, k, grid, clock, closure, velocities, diffusivities)
+ x_curl_Uˢ_cross_U(i, j, k, grid, stokes_drift, velocities, clock.time)
+ ∂t_uˢ(i, j, k, grid, stokes_drift, clock.time)
+ forcings.u(i, j, k, grid, clock, merge(velocities, tracers)))
end

@glwagner
Copy link
Member

Maybe a cleaner solution than adding a property to OutputWriters is to create a new type for a collection of ComputedFields whose kernels are fused, something like FusedComputedFields. Then we need to generalize the output writers and time-averaging machinery to permit dispatch for correct behavior for this special case.

@glwagner glwagner changed the title Should we include some terms already calculated on Oceananigans as diagnostics? Fusing kernels for calculating diagnostics to improve performance Jan 20, 2022
@glwagner glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jan 20, 2022
@tomchor tomchor closed this as completed Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

No branches or pull requests

3 participants