reduce with init x2 to x100 times slower #49763

dpinol · 2023-05-11T13:58:39Z

At least with sum & max, specifying init argument makes the code twice slower for vectors.
The same happens using maximum(array) or sum(array)

julia> const v1=rand(1_000);

julia> @btime reduce(max, $v1)
  930.846 ns (0 allocations: 0 bytes)

julia> @btime reduce(max, $v1; init=0.0)
  1.799 μs (0 allocations: 0 bytes)

I guess the implementations diverge at reducedim.jl.If init is specified, it's computed with a fold instead of a reduce.

mapreduce(f, op, A::AbstractArrayOrBroadcasted; dims=:, init=_InitialValue()) =
    _mapreduce_dim(f, op, init, A, dims)

_mapreduce_dim(f, op, nt, A::AbstractArrayOrBroadcasted, ::Colon) =
    mapfoldl_impl(f, op, nt, A)

_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
    _mapreduce(f, op, IndexStyle(A), A)

With SparseVectors it's even worse (x100 slower)

julia> const vk=sprand(Int, 1000, 1.0);

julia> @btime reduce(max, $vk)
  278.241 ns (0 allocations: 0 bytes)

julia> @btime reduce(max, $vk; init=0)
  26.913 μs (0 allocations: 0 bytes)

Would it make sense to change it to st like this to obtain the same performance?

_mapreduce_dim(f, op, nt, A::AbstractArrayOrBroadcasted, ::Colon) =
    op(nt, _mapreduce(f, op, IndexStyle(A), A))

julia> @btime maximum($v1; init=0)
  907.923 ns (0 allocations: 0 bytes)

julia> @btime maximum($vk; init=0)
  271.606 ns (0 allocations: 0 bytes)

I could reproduce it in julia 1.8.5 & master (1.10)

The text was updated successfully, but these errors were encountered:

stevengj · 2023-05-12T13:01:34Z

Would it make sense to change it to st like this to obtain the same performance?

No, op(nt, _mapreduce(f, op, IndexStyle(A), A)) defeats the purpose of supplying an init argument. For example, if you pass init=big"0.0" then you want the whole reduction (e.g. a sum) to be done in BigFloat precision. Also, an init argument can be essentially if you want to hand empty sums on collections where zero(eltype(collection)) is not available.

That being said, I agree that it would be nice to re-arrange the methods here so that we still call a higher-performance implementation even when given init.

matthias314 · 2023-05-18T21:44:47Z

It depends on the element type. For Int64, the version with init is 3x faster on my machine:

julia> v = rand(Int, 1000);
julia> @btime reduce(max, $v)
  305.313 ns (0 allocations: 0 bytes)
julia> @btime reduce(max, $v; init = 0)
  104.967 ns (0 allocations: 0 bytes)

For Float64 one gets nearly the same timings if one uses @fastmath:

julia> v = rand(Float64, 1000);
julia> @btime reduce(max, $v)
  1.026 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0)
  1.048 μs (0 allocations: 0 bytes)

Would it make sense to always use @fastmath for sum, prod, maximum, minimum and extrema?
EDIT: This doesn't work with NaN.

dpinol · 2023-05-19T07:21:40Z

@matthias314 thanks for your tests.
Strangely, @fastmath makes code slightly slower without init 🤷

julia> @btime @fastmath reduce(max, $v1)
  994.316 ns (0 allocations: 0 bytes)
0.9995914743168539

julia> @btime reduce(max, $v1)
  842.732 ns (0 allocations: 0 bytes)
0.9995914743168539

matthias314 · 2023-05-19T11:42:04Z

Yes, I've also noticed this. I think it would be good to understand how the code with init and without init differ.

On my machine at least, the fast versions of + and * handle NaN correctly (unlike max and min). So there seems to be an easy optimization on the table for sum and prod:

julia> v = rand(1000);
julia> @btime sum($v);
  73.424 ns (0 allocations: 0 bytes)
julia> @btime sum($v; init = 0.0);
  1.020 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(+, $v);
  73.722 ns (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(+, $v; init = 0.0);
  71.631 ns (0 allocations: 0 bytes)

julia> @btime prod($v)
  73.999 ns (0 allocations: 0 bytes)
julia> @btime prod($v; init = 1.0)
  2.533 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(*, $v)
  74.219 ns (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(*, $v; init = 1.0)
  71.762 ns (0 allocations: 0 bytes)

For instance, one could change Base.add_sum and Base.mul_prod to use the fast versions.

giordano · 2023-05-19T13:10:28Z

Duplicate of #47216?

giordano · 2023-05-19T13:37:25Z

Yes, I'm pretty sure this ticket is a duplicate of the other one. Closing this to continue the discussion in a single place instead of spreading it over multiple pages.

stevengj added performance Must go faster arrays [a, r, r, a, y, s] labels May 12, 2023

giordano closed this as completed May 19, 2023

matthias314 mentioned this issue May 19, 2023

sum(f, itr; init) slower than sum(f, itr) #47216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce with init x2 to x100 times slower #49763

reduce with init x2 to x100 times slower #49763

dpinol commented May 11, 2023

stevengj commented May 12, 2023

matthias314 commented May 18, 2023 •

edited

Loading

dpinol commented May 19, 2023

matthias314 commented May 19, 2023

giordano commented May 19, 2023

giordano commented May 19, 2023

reduce with init x2 to x100 times slower #49763

reduce with init x2 to x100 times slower #49763

Comments

dpinol commented May 11, 2023

stevengj commented May 12, 2023

matthias314 commented May 18, 2023 • edited Loading

dpinol commented May 19, 2023

matthias314 commented May 19, 2023

giordano commented May 19, 2023

giordano commented May 19, 2023

matthias314 commented May 18, 2023 •

edited

Loading