-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerating MultiLayerQG on GPUs #373
base: main
Are you sure you want to change the base?
Conversation
…lso explicitly import CPU, GPU from FourierFlows as there were conflicts with KernelAbstractions.
We should bump a patch release. |
src/multilayerqg.jl
Outdated
# if dev == GPU() && nlayers > 2 | ||
# @warn """MultiLayerQG module is not optimized on the GPU yet for configurations with | ||
# 3 fluid layers or more! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment these out? Delete?
src/multilayerqg.jl
Outdated
S, nlayers = params.S, params.nlayers | ||
kernel!(qh, ψh, S, nlayers) | ||
|
||
# This will ensure that no other operations occur until the kernel has finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This will ensure that no other operations occur until the kernel has finished | |
# Ensure that no other operations occur until the kernel has finished |
src/multilayerqg.jl
Outdated
""" | ||
@kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers) | ||
|
||
Kernel for GPU acceleration of streamfunction from PV calculation, i.e., invert the PV to obtain | ||
the Fourier transform of the streamfunction `ψh` in each layer from `qh` using `ψh = params.S⁻¹ qh`. | ||
""" | ||
@kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers) | ||
i, j = @index(Global, NTuple) | ||
|
||
@unroll for k = 1:nlayers | ||
|
||
@inbounds ψh[i, j, k] = 0 | ||
|
||
@unroll for m = 1:nlayers | ||
@inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m] | ||
end | ||
|
||
end | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the two kernel functions identical except the order they want their arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it yes. It'd probably make sense to write one kernel in the more general form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so! Would make the code more robust.
src/multilayerqg.jl
Outdated
@unroll for k = 1:nlayers | ||
|
||
@inbounds qh[i, j, k] = 0 | ||
|
||
@unroll for m = 1:nlayers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @unroll
don't do anything unless nlayers
is known at compile time (this requires using Val
, but I don't know if it will speed anything up... it might).
src/multilayerqg.jl
Outdated
Kernel for GPU acceleration of PV from streamfunction calculation, i.e., obtaining the Fourier | ||
transform of the PV from the streamfunction `ψh` in each layer using `qh = params.S * ψh`. | ||
""" | ||
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers) | |
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, ::Val{nlayers}) where nlayers |
for @unroll
you have to do this, and also pass Val(nlayers)
rather than nlayers
into the kernel when launching it. I don't know if it will speed things up though. It might.
@kernel function PVinversion_kernel!(a, b, M, ::Val{nlayers}) where nlayers | ||
i, j = @index(Global, NTuple) | ||
|
||
@unroll for k = 1:nlayers | ||
|
||
@inbounds a[i, j, k] = 0 | ||
|
||
@unroll for m = 1:nlayers | ||
@inbounds a[i, j, k] += M[i, j][k, m] * b[i, j, m] | ||
end | ||
|
||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the kernel in more general form and added Val
. The code has sped up slightly, but the 16-thread CPU still outperforms the GPU. Compare these benchmarks to what I showed here
-
GPU
nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob) 668.165 ms (2533 allocations: 191.19 KiB)
-
CPU with 16 threads
nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, CPU(); nx); @btime stepforward!(prob) 444.419 ms (113 allocations: 5.61 KiB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you are timing the GPU properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @benchmark CUDA.@sync CUDA.@time stepforward!(prob)
0.681338 seconds (57.95 k CPU allocations: 2.419 MiB) (18 GPU allocations: 345.250 MiB, 0.04% memmgmt time)
0.678481 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676472 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.694825 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678072 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677693 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678237 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.677198 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676980 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676189 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678326 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677010 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677142 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.676321 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.678115 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677461 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678255 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677168 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677529 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 676.718 ms … 678.645 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 677.681 ms ┊ GC (median): 0.00%
Time (mean ± σ): 677.765 ms ± 618.941 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ██ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁██▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
677 ms Histogram: frequency by time 679 ms <
Memory estimate: 199.41 KiB, allocs estimate: 2660.
Seems to be roughly the same as above? Unless I'm misunderstanding what this benchmark is doing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's disappointing...
The first thing I would try to figure out is whether this function is indeed the bottleneck. It might be better, infact, to simply benchmark this function in isolation.
I don't know if it matters but I saw the workgroup is 8, 8. Usually we use 16, 16.
I would also check 3 layers first perhaps. The double inner loop gets slower with more layers, and perhaps the computational costs scale differently on CPU vs GPU. That might give a clue.
The loop is over k, m
--- which are the slowest (last) indices in a
and b
. That could be an issue. If you can benchmark this operation in isolation, then you can experiment with new arrays where k, m
are the fastest / first indices in a
and b
. This experiment would tell you what kind of slow down that's incurring.
Maybe the @unroll
is not working for some reason. When I've seen stuff like this before, people have used matrix / linear algebra via StaticArrays
, rather than explicit loops as used here. If you are just testing the kernel in isolation, transforming a
, b
to arrays of StaticVector
could also be something to experiment with.
In general with performance engineering, one has to really be persistent and creative and test test test. To make this easier you want to extract out the function you're trying to optimize and work with a very idealized test case that also allows you to change the data structure (rather than working with a FourierFlows script). Think of this as research. If you find that you need to rearrange memory differently, then we can come back to GeophysicalFlows and see whether that is feasible or not.
@mpudig any idea why CI breaks? |
This pull request addresses accelerating the PV-stream function inversion in
MultiLayerQG
for arbitrary layers on a GPU, as discussed here.I've used
KernelAbstractions
to optimize what used to be a loop over all (x, y). There is still a loop over number of layers squared. These changes greatly accelerate the code for certain set-ups with more than two layers based on some simple tests (seen here).