Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add n-dimensional repeat #400

Merged
merged 11 commits into from
Jul 6, 2022
Merged

Add n-dimensional repeat #400

merged 11 commits into from
Jul 6, 2022

Conversation

awadell1
Copy link
Contributor

This PR builds on the work of @torfjelde in #357 and the associated comments.

Notable changes:

  • Rebased on top of the v8.3.1 commit
  • Launch threads over the input array instead of the output array (32a6c4d)
    • Eliminates the duplicate reads per input element
    • Locally all tests pass

Best,

Alex

src/host/base.jl Outdated Show resolved Hide resolved
src/host/base.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Mar 22, 2022

Nice, thanks for the PR! Could you do a performance comparison to a broadcast-based implementation, as suggested in #357 (comment)?

@maleadt
Copy link
Member

maleadt commented May 18, 2022

Bump?

@awadell1
Copy link
Contributor Author

Apologies, it's been a busy semester. I'll take another run at it over the next few weeks.

@awadell1
Copy link
Contributor Author

awadell1 commented Jul 1, 2022

For benchmarking, I'm using the following benchmarking suite on NVIDIA A100. I haven't tested it on other GPUs at the moment, but I'd expect similar results.

If it's of interest, I can include these benchmarks in the current pr (See awadell1@427a9bd)

using BenchmarkTools
using CUDA
using GPUArrays

const suite = BenchmarkGroup()

macro benchmark_repeat(f, T, dims)
    quote
        @benchmarkable CUDA.@sync($f) setup=(x = CUDA.rand($T, $(dims)...)) teardown=(CUDA.unsafe_free!(x); CUDA.reclaim())
    end
end

# Control the size of the CUDA Array to be benchmarked
n = 8

# Benchmark `repeat(x, inner=(n, 1, 1))`
s = suite["repeat-inner-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)

s = suite["repeat-inner-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(1, 1, 256)) Float32 (2^n, 2^n, 2^n)

# Benchmark `repeat(x, outer=(n, 1, 1))`
s = suite["repeat-outer-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)

s = suite["repeat-outer-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(1, 1, 256)) Float32 (2^n, 2^n, 2^n)

Results

tldr; Broadcasting is much slower. Parallelizing over the source array is faster than over the destination. Except, for repeat(x; inner=(n, 1...)), where avoiding strided memory access is worth it.

Benchmark n Over Source 89fc61f Over Destination 92141a7 Broadcast 7b82b02 Heuristic d4c9994
repeat-inner-row 64 39.295 ms 23.838 ms 42.778 ms 24.681 ms
repeat-inner-row 128 85.378 ms 32.399 ms 69.956 ms 32.246 ms
repeat-inner-row 256 234.905 ms 64.339 ms 141.257 ms 64.035 ms
repeat-outer-row 64 4.278 ms 16.766 ms 35.163 ms 4.253 ms
repeat-outer-row 128 8.119 ms 33.070 ms 70.229 ms 8.046 ms
repeat-outer-row 256 15.766 ms 73.503 ms 140.882 ms 15.681 ms
repeat-inner-col 64 4.730 ms 16.362 ms 35.214 ms 4.747 ms
repeat-inner-col 128 9.087 ms 32.420 ms 70.037 ms 9.062 ms
repeat-inner-col 256 17.711 ms 64.519 ms 140.536 ms 17.712 ms
repeat-outer-col 64 5.926 ms 17.314 ms 37.021 ms 5.826 ms
repeat-outer-col 128 11.365 ms 34.292 ms 73.612 ms 11.274 ms
repeat-outer-col 256 22.312 ms 68.370 ms 146.976 ms 22.201 ms

Where n is the number of repeats performed, and Benchmark is one of the BenchmarkGroup in the suite above.
ie. suite[["repeat-inner-row", 64]] is the benchmark for the first row shown here. As my system is not perfectly isolated, I'm using the minimum benchmarking time as my estimator of choice (More info on estimators)

Based on these numbers it looks like broadcasting is significantly slower than either existing implementations (89fc61f or 92141a7). And while parallelizing over the source (Minimizes reads) is faster for most cases, it is slower
in the case of repeat(x; inner=(n, 1...)) or the repeat-inner-row benchmarks.

Given that I've added a heuristic to parallelize over the source, unless argmax(inner) == 1. Which seems to give the best of both worlds. Of course, I'm not sure if this is the best heuristic, but it seems to work well enough.

Update

As using a heuristic gave the best performance, I've updated this PR to use it. It's still a bit slower than the other cases (I'm guessing the division isn't helping), but hopefully, it's "good enough"

@awadell1 awadell1 requested a review from maleadt July 1, 2022 19:13
torfjelde and others added 10 commits July 4, 2022 10:43
gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"
Dispatching over output size results in 1 read + 1 write for each output
vs. dispatching over input gives 1 read per input + 1 write per output.
As cartesianidx will return early if it's out-of-bounds
In this case dispatching threads over the destination ends up faster
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks really good! I'd add a comment to explain the heuristic, but otherwise this is good to go.

src/host/base.jl Show resolved Hide resolved
Added comments detailing heuristic and why it works. Expanded testsuite
to ensure both `repeat_inner_src_kernel!` and `repeat_inner_dst_kernel`
are called.
@awadell1
Copy link
Contributor Author

awadell1 commented Jul 5, 2022

I've added comments about the heuristic and a few more tests to ensure both kernels get called. Should be good to merge!

Thanks for reviewing and for your help along the way!

@maleadt
Copy link
Member

maleadt commented Jul 6, 2022

Thank you for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants