Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summing along a dimension of a PermutedDimsArray could be faster #38774

Closed
pdeffebach opened this issue Dec 8, 2020 · 3 comments · Fixed by #39467
Closed

Summing along a dimension of a PermutedDimsArray could be faster #38774

pdeffebach opened this issue Dec 8, 2020 · 3 comments · Fixed by #39467
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@pdeffebach
Copy link
Contributor

I've been wanting to store all my matrices of with particular axes, but some matrices need fast iteration over columns and other matrices need fast iteration over rows. The solution to this is to use PermutedDimsArrays.

However it looks like you don't get the full performance benefit of this strategy using views. Below is an MWE

julia> function summatrows(x)
       N = size(x, 1)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[i, :])
       end
       z
       end;
julia> function summatcols(x)
       N = size(x, 2)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[:, i])
       end
       z
       end;
julia> x = rand(1000, 1000);
julia> y = transpose(permutedims(x));
julia> using BenchmarkTools;
julia> @btime summatcols($x);
  354.489 μs (1 allocation: 7.94 KiB)
julia> @btime summatrows($x);
  2.249 ms (1 allocation: 7.94 KiB)
julia> @btime summatcols($y);
  2.345 ms (1 allocation: 7.94 KiB)
julia> @btime summatrows($y);
  1.269 ms (1 allocation: 7.94 KiB)

The last timing should be around 350 μs, but it is instead more than 3 times that.

Note that I think the @view may be the problem. Consider a scenario that only depends on the order of the loops. There, PermutedDimsArray works as as expected.

julia> function sumcolumnsfast(x::AbstractMatrix)
           s = 0.0
           for i in 1:size(x, 2)
               for j in 1:size(x, 1)
                   s += x[j, i]
               end
           end
           return s
       end
sumcolumnsfast (generic function with 1 method)
julia> function sumrowsfast(x::AbstractMatrix)
           s = 0.0
           for i in 1:size(x, 1)
               for j in 1:size(x, 2)
                   s += x[i, j]
               end
           end
           return s
       end
sumrowsfast (generic function with 1 method)
julia> x = rand(1000, 1000);
julia> y = PermutedDimsArray(permutedims(x), (2, 1));
julia> y == x
true
julia> sumcolumnsfast(x) ≈ sumrowsfast(x)
true
julia> @btime sumcolumnsfast($x)
  1.270 ms (0 allocations: 0 bytes)
499894.0109807858
julia> @btime sumrowsfast($x)
  1.856 ms (0 allocations: 0 bytes)
499894.0109807898
julia> @btime sumcolumnsfast($y)
  1.836 ms (0 allocations: 0 bytes)
499894.0109807858
julia> @btime sumrowsfast($y)
  1.278 ms (0 allocations: 0 bytes)
499894.0109807898

xref #34847. I commented on that issue but it might not be that so I'm filing a new issue here.

Thank you!

@timholy
Copy link
Member

timholy commented Dec 8, 2020

Most likely it's a cache-order effect. See https://julialang.org/blog/2013/09/fast-numeric/#write_cache-friendly_codes

@mcabbott
Copy link
Contributor

mcabbott commented Dec 8, 2020

I agree that summatrows(x) is being asked to do something cache-unfriendly, and is expected to be slow. The view / Transpose wrapper puzzle is why summatrows(y) isn't faster, since that ought to be cache-friendly again.

Without figuring out high-tech things like #34847, is it possible that some views of Transposes should be made to un-wrap and view the original object? This already happens for views of views:

julia> view(x, 1:10, 1) |> typeof
SubArray{Float64, 1, Matrix{Float64}, Tuple{UnitRange{Int64}, Int64}, true}

julia> view(view(x, :, 1), 1:10) |> typeof
SubArray{Float64, 1, Matrix{Float64}, Tuple{UnitRange{Int64}, Int64}, true}

Also, I think summatcols is just sum(x, dims=1), for which #33029 discusses something similar. The lower two times here (reducing y) could almost just dispatch to the upper two, but instead seem to do something more naiive:

julia> summatcols(x) == vec(sum(x, dims=1))
true

julia> @btime sum($x, dims=1);
  293.548 μs (1 allocation: 7.94 KiB)

julia> @btime sum($x, dims=2);
  320.756 μs (5 allocations: 8.02 KiB)

julia> @btime sum($y, dims=1);
  1.430 ms (1 allocation: 7.94 KiB)

julia> @btime sum($y, dims=2);
  1.558 ms (5 allocations: 8.02 KiB)

@jishnub
Copy link
Contributor

jishnub commented Jan 27, 2021

Looking at a section of the mapreduce operation:

julia/base/reducedim.jl

Lines 279 to 284 in 527d6b6

@inbounds for IA in CartesianIndices(indsAt)
IR = Broadcast.newindex(IA, keep, Idefault)
@simd for i in axes(A, 1)
R[i,IR] = op(R[i,IR], f(A[i,IA]))
end
end
it does look like a cache-unfriendly indexing of A if it is a transpose, and flipping the order of the loops improves performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants