Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce with init x2 to x100 times slower #49763

Closed
dpinol opened this issue May 11, 2023 · 6 comments
Closed

reduce with init x2 to x100 times slower #49763

dpinol opened this issue May 11, 2023 · 6 comments
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@dpinol
Copy link

dpinol commented May 11, 2023

At least with sum & max, specifying init argument makes the code twice slower for vectors.
The same happens using maximum(array) or sum(array)

julia> const v1=rand(1_000);

julia> @btime reduce(max, $v1)
  930.846 ns (0 allocations: 0 bytes)

julia> @btime reduce(max, $v1; init=0.0)
  1.799 μs (0 allocations: 0 bytes)

I guess the implementations diverge at reducedim.jl.If init is specified, it's computed with a fold instead of a reduce.

mapreduce(f, op, A::AbstractArrayOrBroadcasted; dims=:, init=_InitialValue()) =
    _mapreduce_dim(f, op, init, A, dims)

_mapreduce_dim(f, op, nt, A::AbstractArrayOrBroadcasted, ::Colon) =
    mapfoldl_impl(f, op, nt, A)

_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
    _mapreduce(f, op, IndexStyle(A), A)

With SparseVectors it's even worse (x100 slower)

julia> const vk=sprand(Int, 1000, 1.0);

julia> @btime reduce(max, $vk)
  278.241 ns (0 allocations: 0 bytes)

julia> @btime reduce(max, $vk; init=0)
  26.913 μs (0 allocations: 0 bytes)

Would it make sense to change it to st like this to obtain the same performance?

_mapreduce_dim(f, op, nt, A::AbstractArrayOrBroadcasted, ::Colon) =
    op(nt, _mapreduce(f, op, IndexStyle(A), A))

julia> @btime maximum($v1; init=0)
  907.923 ns (0 allocations: 0 bytes)

julia> @btime maximum($vk; init=0)
  271.606 ns (0 allocations: 0 bytes)

I could reproduce it in julia 1.8.5 & master (1.10)

@stevengj stevengj added performance Must go faster arrays [a, r, r, a, y, s] labels May 12, 2023
@stevengj
Copy link
Member

Would it make sense to change it to st like this to obtain the same performance?

No, op(nt, _mapreduce(f, op, IndexStyle(A), A)) defeats the purpose of supplying an init argument. For example, if you pass init=big"0.0" then you want the whole reduction (e.g. a sum) to be done in BigFloat precision. Also, an init argument can be essentially if you want to hand empty sums on collections where zero(eltype(collection)) is not available.

That being said, I agree that it would be nice to re-arrange the methods here so that we still call a higher-performance implementation even when given init.

@matthias314
Copy link
Contributor

matthias314 commented May 18, 2023

It depends on the element type. For Int64, the version with init is 3x faster on my machine:

julia> v = rand(Int, 1000);
julia> @btime reduce(max, $v)
  305.313 ns (0 allocations: 0 bytes)
julia> @btime reduce(max, $v; init = 0)
  104.967 ns (0 allocations: 0 bytes)

For Float64 one gets nearly the same timings if one uses @fastmath:

julia> v = rand(Float64, 1000);
julia> @btime reduce(max, $v)
  1.026 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0)
  1.048 μs (0 allocations: 0 bytes)

Would it make sense to always use @fastmath for sum, prod, maximum, minimum and extrema?
EDIT: This doesn't work with NaN.

@dpinol
Copy link
Author

dpinol commented May 19, 2023

@matthias314 thanks for your tests.
Strangely, @fastmath makes code slightly slower without init 🤷

julia> @btime @fastmath reduce(max, $v1)
  994.316 ns (0 allocations: 0 bytes)
0.9995914743168539

julia> @btime reduce(max, $v1)
  842.732 ns (0 allocations: 0 bytes)
0.9995914743168539

@matthias314
Copy link
Contributor

Yes, I've also noticed this. I think it would be good to understand how the code with init and without init differ.

On my machine at least, the fast versions of + and * handle NaN correctly (unlike max and min). So there seems to be an easy optimization on the table for sum and prod:

julia> v = rand(1000);
julia> @btime sum($v);
  73.424 ns (0 allocations: 0 bytes)
julia> @btime sum($v; init = 0.0);
  1.020 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(+, $v);
  73.722 ns (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(+, $v; init = 0.0);
  71.631 ns (0 allocations: 0 bytes)

julia> @btime prod($v)
  73.999 ns (0 allocations: 0 bytes)
julia> @btime prod($v; init = 1.0)
  2.533 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(*, $v)
  74.219 ns (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(*, $v; init = 1.0)
  71.762 ns (0 allocations: 0 bytes)

For instance, one could change Base.add_sum and Base.mul_prod to use the fast versions.

@giordano
Copy link
Contributor

Duplicate of #47216?

@giordano
Copy link
Contributor

Yes, I'm pretty sure this ticket is a duplicate of the other one. Closing this to continue the discussion in a single place instead of spreading it over multiple pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants