Fusing kernels for calculating diagnostics to improve performance #1483

tomchor · 2021-03-16T17:34:43Z

So, I've been running some simulations with a bunch of diagnostics, which make the simulation slower. I think some of the diagnostics that I'm calculating are already being calculated by Oceananigans (like derivatives). If I understand correctly, in each of those cases Oceananigans calculates that variable once to get the tendencies and then does the same calculation again when for my diagnostics, which seems wasteful.

Is it possible (or desirable) to create a way for the code not to do the calculation twice if the user wants that diagnostic specifically? Maybe pass options when creating the model like:

model = IncompressibleModel(grid, computed_fields=(:dudz, :dvdz,))

so that those fields get stored just like the velocity and the tracers do?

Thoughts?

francispoulin · 2021-03-16T18:52:28Z

The library computes the derivatives that are required to compute the tendencies, but they are not stored since that would not be very efficient. Getting some of these values but I don't know what that would look like. Also, if you are not computing this field at every time step, the cost of computing it sepratelyl might not be that high, but that of course depends on the particular problem you are dealing with.

glwagner · 2021-03-16T20:08:10Z

For better or for worse, Oceananigans currently does not store intermediate terms in the computation of a PDE's right hand side (with notable exceptions hydrostatic pressure and eddy diffusivities). In other words, a single, sometimes large kernel that evaluates the right hand side at each grid point i, j, k is compiled for each PDE in a user's model, and there's no way to pull out intermediate steps in that calculation to use elsewhere.

It's important to note when considering optimization strategies that our computations are probably memory-limited, rather than compute-limited. In other words, we think the process of transferring data from global memory to local thread memory is a bottleneck for our computations (we can really only know this through profiling a particular application, however, since all models are different...) Storing intermediate components of the tendency terms would probably create more memory accesses overall (since rather than immediately using intermediate results for subsequent calculations, we would have to send them to global memory, and then back, to complete the evaluation of a tendency) --- and thus could slow down tendency evaluations that are performed 1-3 times per time-step. For example, our best idea for speeding up tendency evaluations is to better manage memory movement using GPU shared memory (unfortunately, we haven't had the time to explore such optimization strategies...)

I think there may be other ways to optimize diagnostics calculations, however.

Fusing `ComputedField` kernels

One possibility to speed up diagnostics is to "fuse" kernels for different ComputedField diagnostics. The kernel for a ComputedField is

Oceananigans.jl/src/Fields/computed_field.jl

Lines 112 to 115 in 9b52f3f

    
           @kernel function _compute!(data, operand) 
        
               i, j, k = @index(Global, NTuple) 
        
               @inbounds data[i, j, k] = operand[i, j, k] 
        
           end

where operand is an AbstractOperation. But different ComputedFields may somehow depend on the same underlying data in memory. Thus if the kernels for differnet ComputedFields are fused into one, we overlap memory accesses for different computations. Our computations are usually memory-limited... so its possible this strategy could produce significant speed ups. For example, for two ComputedFields we might have something like

function compute!(field1, field2)
    # calls _compute_two(field1.data, field2.data, field1.operand, field2.operand)
end

and the kernel

@kernel function _compute_two!(data1, data2, operand1, operand2)
    i, j, k = @index(Global, NTuple)
    @inbounds data1[i, j, k] = operand1[i, j, k]
    @inbounds data2[i, j, k] = operand2[i, j, k]
end

There should also be a way to generalize to the nth case using some ntuple magic.

(Note that we tried this with tracer kernels previously without obtaining any speed up, but overlapping ComputedFields could be a more promising application of this technique.)

Using `mapreduce` for averaging `AbstractOperations`

We also might be able to apply mapreduce directly to AbstractOperations, rather than using our currently strategy of storing the intermediate result of an operation in a ComputedField, and then averaging the intermediate result. This technique is discussed on #1422 .

tomchor · 2021-03-16T20:31:39Z

The library computes the derivatives that are required to compute the tendencies, but they are not stored since that would not be very efficient

Oh yeah, for sure! The idea is that they would be stored only if the user specified interest in them. That way instead of calculating twice and storing once, the code would calculate and store once. The default behavior still would be calculating and not storing though.

tomchor · 2021-03-16T20:37:31Z

In other words, a single, sometimes large kernel that evaluates the right hand side at each grid point i, j, k is compiled for each PDE in a user's model, and there's no way to pull out intermediate steps in that calculation to use elsewhere.

That's interesting. I wasn't aware there was only one kernel for the whole RHS. That's pretty smart.

One possibility to speed up diagnostics is to "fuse" kernels for different ComputedField diagnostics.

That's a really cool idea. I can see how we could do that with KernelComputedFields, but I can't really see how to do it for ComputedFields. It would definitely be great if that could be easily automatized.

glwagner · 2021-03-16T20:54:02Z

That's a really cool idea. I can see how we could do that with KernelComputedFields, but I can't really see how to do it for ComputedFields.

I think we understand how to write a compute! function that fuses the kernels for multiple ComputedFields or KernelComputedFields. But the harder part is designing an API that implements fusion for an output writer. We might need to add a property called something like fused_compute to both OutputWriters. For example, JLD2OutputWriter fetches output in a loop:

Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl

Lines 195 to 196 in 9b52f3f

    
           tc = Base.@elapsed data = Dict((name, fetch_and_convert_output(output, model, writer)) for (name, output) 
        
                                          in zip(keys(writer.outputs), values(writer.outputs)))

where fetch_and_convert_output calls fetch_output:

Oceananigans.jl/src/OutputWriters/fetch_output.jl

Lines 13 to 16 in 9b52f3f

    
           function fetch_output(field::AbstractField, model, field_slicer) 
        
               compute_at!(field, time(model)) 
        
               return slice_parent(field_slicer, field) 
        
           end

which in turn calls compute_at!.

But we want to trigger one call to compute_at! for all the outputs at the same time ... I think ... so that requires some refactoring and API design.

glwagner · 2021-03-16T20:57:41Z

And whatever we come up with it also has to work with WindowedTimeAverage, which calls fetch_output here:

Oceananigans.jl/src/OutputWriters/windowed_time_average.jl

Line 135 in 9b52f3f

integrand = fetch_output(wta.operand, model, wta.field_slicer)

glwagner · 2021-03-16T20:58:23Z

I wasn't aware there was only one kernel for the whole RHS.

For example, the function called in the kernel for "u" (x-velocity) is

Oceananigans.jl/src/Models/IncompressibleModels/velocity_and_tracer_tendencies.jl

Lines 44 to 66 in 9b52f3f

    
           @inline function u_velocity_tendency(i, j, k, grid, 
        
                                                advection, 
        
                                                coriolis, 
        
                                                stokes_drift, 
        
                                                closure, 
        
                                                background_fields, 
        
                                                velocities, 
        
                                                tracers, 
        
                                                diffusivities, 
        
                                                forcings, 
        
                                                hydrostatic_pressure, 
        
                                                clock) 
        
               return ( - div_Uu(i, j, k, grid, advection, velocities, velocities.u) 
        
                        - div_Uu(i, j, k, grid, advection, background_fields.velocities, velocities.u) 
        
                        - div_Uu(i, j, k, grid, advection, velocities, background_fields.velocities.u) 
        
                        - x_f_cross_U(i, j, k, grid, coriolis, velocities) 
        
                        - ∂xᶠᵃᵃ(i, j, k, grid, hydrostatic_pressure) 
        
                        + ∂ⱼ_2ν_Σ₁ⱼ(i, j, k, grid, clock, closure, velocities, diffusivities) 
        
                        + x_curl_Uˢ_cross_U(i, j, k, grid, stokes_drift, velocities, clock.time) 
        
                        + ∂t_uˢ(i, j, k, grid, stokes_drift, clock.time) 
        
                        + forcings.u(i, j, k, grid, clock, merge(velocities, tracers))) 
        
           end

glwagner · 2021-03-16T21:06:46Z

Maybe a cleaner solution than adding a property to OutputWriters is to create a new type for a collection of ComputedFields whose kernels are fused, something like FusedComputedFields. Then we need to generalize the output writers and time-averaging machinery to permit dispatch for correct behavior for this special case.

glwagner changed the title ~~Should we include some terms already calculated on Oceananigans as diagnostics?~~ Fusing kernels for calculating diagnostics to improve performance Jan 20, 2022

glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jan 20, 2022

tomchor closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusing kernels for calculating diagnostics to improve performance #1483

Fusing kernels for calculating diagnostics to improve performance #1483

tomchor commented Mar 16, 2021 •

edited

Loading

francispoulin commented Mar 16, 2021

glwagner commented Mar 16, 2021

tomchor commented Mar 16, 2021

tomchor commented Mar 16, 2021

glwagner commented Mar 16, 2021

glwagner commented Mar 16, 2021

glwagner commented Mar 16, 2021 •

edited

Loading

glwagner commented Mar 16, 2021

Fusing kernels for calculating diagnostics to improve performance #1483

Fusing kernels for calculating diagnostics to improve performance #1483

Comments

tomchor commented Mar 16, 2021 • edited Loading

francispoulin commented Mar 16, 2021

glwagner commented Mar 16, 2021

Fusing ComputedField kernels

Using mapreduce for averaging AbstractOperations

tomchor commented Mar 16, 2021

tomchor commented Mar 16, 2021

glwagner commented Mar 16, 2021

glwagner commented Mar 16, 2021

glwagner commented Mar 16, 2021 • edited Loading

glwagner commented Mar 16, 2021

tomchor commented Mar 16, 2021 •

edited

Loading

Fusing `ComputedField` kernels

Using `mapreduce` for averaging `AbstractOperations`

glwagner commented Mar 16, 2021 •

edited

Loading