-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add n-dimensional repeat #400
Conversation
Nice, thanks for the PR! Could you do a performance comparison to a broadcast-based implementation, as suggested in #357 (comment)? |
Bump? |
Apologies, it's been a busy semester. I'll take another run at it over the next few weeks. |
For benchmarking, I'm using the following benchmarking suite on NVIDIA A100. I haven't tested it on other GPUs at the moment, but I'd expect similar results.
using BenchmarkTools
using CUDA
using GPUArrays
const suite = BenchmarkGroup()
macro benchmark_repeat(f, T, dims)
quote
@benchmarkable CUDA.@sync($f) setup=(x = CUDA.rand($T, $(dims)...)) teardown=(CUDA.unsafe_free!(x); CUDA.reclaim())
end
end
# Control the size of the CUDA Array to be benchmarked
n = 8
# Benchmark `repeat(x, inner=(n, 1, 1))`
s = suite["repeat-inner-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)
s = suite["repeat-inner-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(1, 1, 256)) Float32 (2^n, 2^n, 2^n)
# Benchmark `repeat(x, outer=(n, 1, 1))`
s = suite["repeat-outer-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)
s = suite["repeat-outer-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(1, 1, 256)) Float32 (2^n, 2^n, 2^n) Results
Where Based on these numbers it looks like broadcasting is significantly slower than either existing implementations (89fc61f or 92141a7). And while parallelizing over the source (Minimizes reads) is faster for most cases, it is slower Given that I've added a heuristic to parallelize over the source, unless UpdateAs using a heuristic gave the best performance, I've updated this PR to use it. It's still a bit slower than the other cases (I'm guessing the division isn't helping), but hopefully, it's "good enough" |
gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"
Dispatching over output size results in 1 read + 1 write for each output vs. dispatching over input gives 1 read per input + 1 write per output.
As cartesianidx will return early if it's out-of-bounds
In this case dispatching threads over the destination ends up faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks really good! I'd add a comment to explain the heuristic, but otherwise this is good to go.
Added comments detailing heuristic and why it works. Expanded testsuite to ensure both `repeat_inner_src_kernel!` and `repeat_inner_dst_kernel` are called.
I've added comments about the heuristic and a few more tests to ensure both kernels get called. Should be good to merge! Thanks for reviewing and for your help along the way! |
Thank you for the PR! |
This PR builds on the work of @torfjelde in #357 and the associated comments.
Notable changes:
repeat
for Julia 1.6 and higher #357 code to useelements
instead oftotal_threads
Best,
Alex