`sum(f, itr; init)` slower than `sum(f, itr)` #47216

giordano · 2022-10-18T20:17:32Z

In https://github.com/niklas-heer/speed-comparison we noticed that sum(f, itr; init) is slower than sum(f, itr)

julia> f(n) = sum(i -> 4/(4i-2n-3), 1:n)
f (generic function with 1 method)

julia> g(n) = sum(i -> 4/(4i-2n-3), 1:n; init=0.0)
g (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f(100_000_000)
  217.560 ms (0 allocations: 0 bytes)
3.1415926435897923

julia> @btime g(100_000_000)
  398.015 ms (0 allocations: 0 bytes)
3.1415926435899633

julia> versioninfo()
Julia Version 1.9.0-DEV.1600
Commit 392bc97a3a (2022-10-16 17:39 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, haswell)
  Threads: 1 on 8 virtual cores

which is a pity because sum(f, itr; init) can be statically compiled, while sum(f, itr) cannot (because of some internal function throwing).

The text was updated successfully, but these errors were encountered:

Moelf · 2022-10-19T00:23:51Z

the fast path

f(n) = Base._mapreduce_dim(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n, :)

## so totally different path
_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
    _mapreduce(f, op, IndexStyle(A), A)

for array longer than 16, eventually calls:

mapreduce_impl(f, op, A, first(inds), last(finds))

mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize=1024)

the slow path:

g(n) = Base.mapfoldl_impl(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n)

N5N3 · 2022-10-19T01:06:00Z

Our fold routine has no @simd thus we need @fastmath here to make LLVM vectorlize the sum reduction.

g_fast(n) = @fastmath mapfoldl(i -> 4/(4i-2n-3), +, 1:n)

In fact g_fast should be faster than f as it doesn't use pairwise reduction.

matthias314 · 2023-05-19T14:12:08Z

Let me repeat my suggestion from #49763: sum and prod use add_sum and mul_prod. What about making these functions use the fast versions of + and *? I think there is no fixed order of the terms required for sum and prod. Moreover, the fast versions seem to respect NaN (at least on my machine). Then we would get:

julia> Base.add_sum(x::Real, y::Real) = @fastmath x+y
julia> v = rand(1000);
julia> @btime sum($v);
  72.881 ns (0 allocations: 0 bytes)   # before: 73.487 ns
julia> @btime sum($v; init = 0.0);
  71.168 ns (0 allocations: 0 bytes)   # before: 1.020 μs

In order to extend this to maximum and minimum, there are two issues:

The current versions of max_fast and min_fast do not vectorize.
It seems tricky to get versions that vectorize and treat NaN as it's done now.

For example, the function

@generated function fastmax(x::Float64, y::Float64)
    quote
        $(Expr(:meta, :inline));
        Base.llvmcall("""
    %f = fcmp fast ugt double %0, %1
    %r = select i1 %f, double %0, double %1
    ret double %r
            """, Float64, Tuple{Float64,Float64}, x, y)
    end
end

vectorizes, but it doesn't always return NaN if one of the arguments is NaN. Maybe one could add a flag to maximum and minimum to indicate that such a function can be used. It would be much faster than the current implementation:

julia> v = rand(1000);
# currently:
julia> @btime maximum($v);
  1.019 μs (0 allocations: 0 bytes)
julia> @btime maximum($v; init = 0.0);
  1.988 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v);
  1.024 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0);
  1.020 μs (0 allocations: 0 bytes)
# new
julia> @btime reduce(fastmax, $v);
  74.428 ns (0 allocations: 0 bytes)
julia> @btime reduce(fastmax, $v; init = 0.0);
  71.370 ns (0 allocations: 0 bytes)

giordano added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Oct 18, 2022

giordano mentioned this issue May 19, 2023

reduce with init x2 to x100 times slower #49763

Closed

matthias314 mentioned this issue May 21, 2023

@fastmath support for sum, prod, extrema and extrema! #49910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sum(f, itr; init)` slower than `sum(f, itr)` #47216

`sum(f, itr; init)` slower than `sum(f, itr)` #47216

giordano commented Oct 18, 2022

Moelf commented Oct 19, 2022 •

edited

Loading

N5N3 commented Oct 19, 2022

matthias314 commented May 19, 2023

sum(f, itr; init) slower than sum(f, itr) #47216

sum(f, itr; init) slower than sum(f, itr) #47216

Comments

giordano commented Oct 18, 2022

Moelf commented Oct 19, 2022 • edited Loading

N5N3 commented Oct 19, 2022

matthias314 commented May 19, 2023

`sum(f, itr; init)` slower than `sum(f, itr)` #47216

`sum(f, itr; init)` slower than `sum(f, itr)` #47216

Moelf commented Oct 19, 2022 •

edited

Loading