Add n-dimensional repeat #400

awadell1 · 2022-03-18T21:55:02Z

This PR builds on the work of @torfjelde in #357 and the associated comments.

Notable changes:

Rebased on top of the v8.3.1 commit
- c5a16b2 updates repeat for Julia 1.6 and higher #357 code to use elements instead of total_threads
- Locally all tests pass with this version
Launch threads over the input array instead of the output array (32a6c4d)
- Eliminates the duplicate reads per input element
- Locally all tests pass

Best,

Alex

src/host/base.jl

maleadt · 2022-03-22T07:53:56Z

Nice, thanks for the PR! Could you do a performance comparison to a broadcast-based implementation, as suggested in #357 (comment)?

maleadt · 2022-05-18T06:47:19Z

Bump?

awadell1 · 2022-05-18T12:06:07Z

Apologies, it's been a busy semester. I'll take another run at it over the next few weeks.

awadell1 · 2022-07-01T19:08:04Z

For benchmarking, I'm using the following benchmarking suite on NVIDIA A100. I haven't tested it on other GPUs at the moment, but I'd expect similar results.

If it's of interest, I can include these benchmarks in the current pr (See awadell1@427a9bd)

using BenchmarkTools
using CUDA
using GPUArrays

const suite = BenchmarkGroup()

macro benchmark_repeat(f, T, dims)
    quote
        @benchmarkable CUDA.@sync($f) setup=(x = CUDA.rand($T, $(dims)...)) teardown=(CUDA.unsafe_free!(x); CUDA.reclaim())
    end
end

# Control the size of the CUDA Array to be benchmarked
n = 8

# Benchmark `repeat(x, inner=(n, 1, 1))`
s = suite["repeat-inner-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)

s = suite["repeat-inner-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, inner=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, inner=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, inner=(1, 1, 256)) Float32 (2^n, 2^n, 2^n)

# Benchmark `repeat(x, outer=(n, 1, 1))`
s = suite["repeat-outer-row"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(64 , 1, 1)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(128, 1, 1)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(256, 1, 1)) Float32 (2^n, 2^n, 2^n)

s = suite["repeat-outer-col"] = BenchmarkGroup()
s[64] = @benchmark_repeat repeat(x, outer=(1, 1, 64)) Float32 (2^n, 2^n, 2^n)
s[128] = @benchmark_repeat repeat(x, outer=(1, 1, 128)) Float32 (2^n, 2^n, 2^n)
s[256] = @benchmark_repeat repeat(x, outer=(1, 1, 256)) Float32 (2^n, 2^n, 2^n)

Results

tldr; Broadcasting is much slower. Parallelizing over the source array is faster than over the destination. Except, for repeat(x; inner=(n, 1...)), where avoiding strided memory access is worth it.

Benchmark	n	Over Source `89fc61f`	Over Destination `92141a7`	Broadcast `7b82b02`	Heuristic `d4c9994`
repeat-inner-row	64	39.295 ms	23.838 ms	42.778 ms	24.681 ms
repeat-inner-row	128	85.378 ms	32.399 ms	69.956 ms	32.246 ms
repeat-inner-row	256	234.905 ms	64.339 ms	141.257 ms	64.035 ms
repeat-outer-row	64	4.278 ms	16.766 ms	35.163 ms	4.253 ms
repeat-outer-row	128	8.119 ms	33.070 ms	70.229 ms	8.046 ms
repeat-outer-row	256	15.766 ms	73.503 ms	140.882 ms	15.681 ms
repeat-inner-col	64	4.730 ms	16.362 ms	35.214 ms	4.747 ms
repeat-inner-col	128	9.087 ms	32.420 ms	70.037 ms	9.062 ms
repeat-inner-col	256	17.711 ms	64.519 ms	140.536 ms	17.712 ms
repeat-outer-col	64	5.926 ms	17.314 ms	37.021 ms	5.826 ms
repeat-outer-col	128	11.365 ms	34.292 ms	73.612 ms	11.274 ms
repeat-outer-col	256	22.312 ms	68.370 ms	146.976 ms	22.201 ms

Where n is the number of repeats performed, and Benchmark is one of the BenchmarkGroup in the suite above.
ie. suite[["repeat-inner-row", 64]] is the benchmark for the first row shown here. As my system is not perfectly isolated, I'm using the minimum benchmarking time as my estimator of choice (More info on estimators)

Based on these numbers it looks like broadcasting is significantly slower than either existing implementations (89fc61f or 92141a7). And while parallelizing over the source (Minimizes reads) is faster for most cases, it is slower
in the case of repeat(x; inner=(n, 1...)) or the repeat-inner-row benchmarks.

Given that I've added a heuristic to parallelize over the source, unless argmax(inner) == 1. Which seems to give the best of both worlds. Of course, I'm not sure if this is the best heuristic, but it seems to work well enough.

Update

As using a heuristic gave the best performance, I've updated this PR to use it. It's still a bit slower than the other cases (I'm guessing the division isn't helping), but hopefully, it's "good enough"

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

Dispatching over output size results in 1 read + 1 write for each output vs. dispatching over input gives 1 read per input + 1 write per output.

As cartesianidx will return early if it's out-of-bounds

In this case dispatching threads over the destination ends up faster

maleadt

Thanks, this looks really good! I'd add a comment to explain the heuristic, but otherwise this is good to go.

src/host/base.jl

Added comments detailing heuristic and why it works. Expanded testsuite to ensure both `repeat_inner_src_kernel!` and `repeat_inner_dst_kernel` are called.

awadell1 · 2022-07-05T18:00:04Z

I've added comments about the heuristic and a few more tests to ensure both kernels get called. Should be good to merge!

Thanks for reviewing and for your help along the way!

maleadt · 2022-07-06T06:31:41Z

Thank you for the PR!

maleadt reviewed Mar 22, 2022

View reviewed changes

src/host/base.jl Outdated Show resolved Hide resolved

maleadt reviewed Mar 22, 2022

View reviewed changes

src/host/base.jl Outdated Show resolved Hide resolved

maleadt force-pushed the tor/repeat branch from 32a6c4d to 89fc61f Compare May 17, 2022 21:19

awadell1 requested a review from maleadt July 1, 2022 19:13

torfjelde and others added 10 commits July 4, 2022 10:43

added impl for repeat

319985d

added tests for new impl of repeat

4919a17

fixed a typo in tests

ba42beb

fix for allocation of out in repeat

3af7775

added check for 0-repeats

8f46904

drop check for julia version as GPUArrays requires ≥ 1.6 anyways

3354a04

Swap total_threads with elements

170e91a

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

Dispatch threads over input elements

e3d9e84

Dispatching over output size results in 1 read + 1 write for each output vs. dispatching over input gives 1 read per input + 1 write per output.

Getting val is inbounds

ca6ef56

As cartesianidx will return early if it's out-of-bounds

Add heuristic for repeat(x; inner(n, 1...)

c3503b9

In this case dispatching threads over the destination ends up faster

awadell1 force-pushed the tor/repeat branch from d4c9994 to c3503b9 Compare July 4, 2022 15:12

maleadt approved these changes Jul 5, 2022

View reviewed changes

src/host/base.jl Show resolved Hide resolved

Expand comments and tests for repeat

3ecba36

Added comments detailing heuristic and why it works. Expanded testsuite to ensure both `repeat_inner_src_kernel!` and `repeat_inner_dst_kernel` are called.

maleadt merged commit ec12883 into JuliaGPU:master Jul 6, 2022

mcabbott mentioned this pull request Jul 13, 2022

rrule causes scalar indexing for repeat JuliaDiff/ChainRules.jl#645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add n-dimensional repeat #400

Add n-dimensional repeat #400

awadell1 commented Mar 18, 2022

maleadt commented Mar 22, 2022

maleadt commented May 18, 2022

awadell1 commented May 18, 2022

awadell1 commented Jul 1, 2022

maleadt left a comment

awadell1 commented Jul 5, 2022

maleadt commented Jul 6, 2022

Add n-dimensional repeat #400

Add n-dimensional repeat #400

Conversation

awadell1 commented Mar 18, 2022

maleadt commented Mar 22, 2022

maleadt commented May 18, 2022

awadell1 commented May 18, 2022

awadell1 commented Jul 1, 2022

Results

Update

maleadt left a comment

Choose a reason for hiding this comment

awadell1 commented Jul 5, 2022

maleadt commented Jul 6, 2022