-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for i in iter
#65
Conversation
One cautionary note: in a test script here I get this: julia> @btime old2d!($out, $A, $kern);
67.972 μs (0 allocations: 0 bytes)
julia> @btime avx2d!($out, $A, $kern);
787.508 μs (0 allocations: 0 bytes) It seems possible in principle that this is a consequence of this PR. As you'll see in the gist (for at least some of these), other challenges include:
|
Codecov Report
@@ Coverage Diff @@
## master #65 +/- ##
==========================================
- Coverage 92.95% 92.93% -0.03%
==========================================
Files 25 25
Lines 2810 2815 +5
==========================================
+ Hits 2612 2616 +4
- Misses 198 199 +1
Continue to review full report at Codecov.
|
Update: yep, that's a lot of it but not all of it: I added an julia> @btime avx2d_unpack!($out, $A, $kern);
101.279 μs (0 allocations: 0 bytes) which is much better but still worse than the |
# include your gist here
using VectorizationBase, BenchmarkTools
# gesp is get element strided pointer
VectorizationBase.stridedpointer(A::OffsetArray) = VectorizationBase.gesp(stridedpointer(parent(A)), -1 .* A.offsets)
function avx2d_unpackall!(out::AbstractMatrix, A::AbstractMatrix, kern::OffsetArray, R=CartesianIndices(out), z=zero(eltype(out)))
rng1k, rng2k = axes(kern)
rng1kf, rng1kl = first(rng1k), last(rng1k)
rng2kf, rng2kl = first(rng2k), last(rng2k)
rng1, rng2 = R.indices
rng1f, rng1l = first(rng1), last(rng1)
rng2f, rng2l = first(rng2), last(rng2)
# Manually unpack the OffsetArray
kernA = parent(kern)
o1, o2 = kern.offsets
@avx for j in rng2f:rng2l, i in rng1f:rng1l
tmp = z
for jk in rng2kf:rng2kl, ik in rng1kf:rng1kl
tmp += A[i+ik,j+jk]*kernA[ik-o1,jk-o2]
end
out[i,j] = tmp
end
out
end
A = rand(Float32, 100, 100);
kern = centered(rand(Float32, 3, 3));
out1 = OffsetArray(similar(A, size(A).-2), 1, 1); # stay away from the edges of A
out2 = similar(out1); out3 = similar(out1);
@benchmark old2d!($out1, $A, $kern)
@benchmark avx2d_unpack!($out2, $A, $kern)
@benchmark avx2d_unpackall!($out3, $A, $kern)
out1 ≈ out2
out1 ≈ out3 On my computer (with avx512), this yields julia> out1 = OffsetArray(similar(A, size(A).-2), 1, 1); # stay away from the edges of A
julia> out2 = similar(out1); out3 = similar(out1);
julia> @benchmark old2d!($out1, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 66.879 μs (0.00% GC)
median time: 66.952 μs (0.00% GC)
mean time: 67.110 μs (0.00% GC)
maximum time: 107.237 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark avx2d_unpack!($out2, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 109.412 μs (0.00% GC)
median time: 109.645 μs (0.00% GC)
mean time: 109.798 μs (0.00% GC)
maximum time: 156.803 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark avx2d_unpackall!($out3, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.055 μs (0.00% GC)
median time: 3.074 μs (0.00% GC)
mean time: 3.078 μs (0.00% GC)
maximum time: 5.444 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> out1 ≈ out2
true
julia> out1 ≈ out3
true Applying Normally, applying It may be worth considering passing size information when some loops are much longer than others (eg, the outer loops vs the inner loops) to reduce the risk of it deciding to do something unproductive. E.g., it could decide to 4x unroll and vectorize (8x or 16x factor for single precision depending on avx/avx512), yielding a total of 32x or 64x unrolling. If one loop is only 7 iterations, it had best not do that. I added "size hint" infrastructure that there currently isn't any way to take advantage of. These are used for optimization decisions, but not for lowering. Currently, the hint is always set to 1024. Perhaps we can use this in some way to make sure it optimizes as though these inner loops are short, while still keeping the number of iterations dynamic.
I should add the offset array specialization for
It would be fantastic to add
Two problems here. The first is that support is currently lousy for mixing types within the loops. I originally envisioned everything being The second problem -- supporting generic number types -- is much harder. The problem then is that given type Here is an issue specific to The manual approach of adding support per element type would work, but I am sure there is a means aligned with and showing off Julia's generic power. I'll take a look at what's going wrong with the iterables performance-wise soon. |
Love it!
The ultimate solution seems to be moving this into the compiler, since there you have everything (you know all the types). But in the meanwhile, would it help to use https://github.com/JuliaArrays/TiledIteration.jl and convert chunks of the array to a common numeric type? |
As an example of your unrolling, it would be nice for handling colors: using ImageCore
c = RGB(0.12, 0.45, 0.38) Typically the filter is real-valued but pixels are often color. |
I'm going to push to this PR soon. I'm changing how loop bounds are handled. |
…rables, and handle different starting offsets of OffsetArrays based on linear vs cartesian indexing.
Ultimately, it would be great if MLIR (or another project) renders the performance side of LoopVectorization obsolete. But punting the loop to a generated function does give us type information useful for some things, but not nearly as much information as available to the compiler.
What sort of type conversions can it do? In my recent commits, I now have |
No rush, I'm playing with an an attempt to add support for external passes to Julia's compiler. The idea is to allow you to set up a callback function that receives the type-inferred CodeInfo with something like macro avx(ex)
esc(Expr(:block, Expr(:meta, Expr(:external_pass, :start, avx_pass)), ex, Expr(:meta, Expr(:external_pass, :stop))))
end where you've defined the function If this works you'll really hate me because it would basically encourage you to rewrite this package from scratch 😦. OTOH in the long run it should make a lot of things a lot easier. |
I don't seem to know how to push to PRs; I pushed here by mistake: julia> @benchmark old2d!($out2, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 67.172 μs (0.00% GC)
median time: 67.442 μs (0.00% GC)
mean time: 67.769 μs (0.00% GC)
maximum time: 111.490 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark avx2d!($out2, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 130.930 μs (0.00% GC)
median time: 132.447 μs (0.00% GC)
mean time: 132.576 μs (0.00% GC)
maximum time: 188.808 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark avx2douter!($out3, $A, $kern)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.662 μs (0.00% GC)
median time: 2.684 μs (0.00% GC)
mean time: 2.686 μs (0.00% GC)
maximum time: 4.959 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9 Where they're defined here (and tests pass), using ranges for iteration. |
If it's a |
I don't pretend to really understand this package, but FWIW it looks good to me! If you're happy merge whenever. Thanks for the amazing support! I'll keep you posted about the compiler work. |
I can merge this and that commit separately, or I can wait until you merge that commit into this PR On indexing Thanks for the PR and interest. I'll definitely be interested in your compiler work, especially if like |
Somehow it's not updating that PR. Feel free to just open a PR at that |
This supports syntax like
Because
iter
could be aCartesianIndex
I presume we have to treat it the same wayeachindex(A)
gets treated.Some of this is just whitespace changes (courtesy of Atom), but you can select to ignore it when reviewing. Also note I added a couple of
using
statements at the top of test files to facilitate running them independently. If you don't like that I can remove that part.