Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@fastmath support for sum, prod, extrema and extrema! #49910

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

matthias314
Copy link
Contributor

@matthias314 matthias314 commented May 21, 2023

This PR extends #48153 by adding @fastmath support for sum, prod, extrema and extrema!. This would provide a solution to #47216.

Currently, @fastmath has no effect for the functions mentioned above. With this PR the speedup factor is

Function Float16 Float32 Float64
sum(v) 1.0 1.0 1.0
sum(v; init=T(0)) 28.9 23.8 14.2
prod(v) 1.0 1.0 1.0
prod(v; init=T(1)) 29.1 22.5 36.1
extrema(v) N/A 38.2 20.4
extrema(v; init=(T(1),T(0))) N/A 40.9 21.1
extrema!(rc, a) N/A 1.2 1.1
extrema!(rr, a) N/A 21.0 7.3

Here T is the given float type, n = 1000, v = rand(T, n), a = rand(T, n, n), rc = fill((T(0),T(0)), n) and rr = fill((T(0),T(0)), 1, n). At present, the extrema functions crash for Float16 because of #49907. For the same reason, the corresponding tests are currently skipped in the test file. I have also tried fast versions of sum! and prod!, but I couldn't get any improvement there.

@Seelengrab
Copy link
Contributor

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary - the assumptions violated by @fastmath need to hold for the regular versions after all.

Since this is quite hardware dependent, can you share your benchmarking code so others can corroborate the speedups?

@matthias314
Copy link
Contributor Author

Could one always use @fastmath for sum and prod?. When I tried it out, the fast versions dealt with NaN correctly. Rearranging terms should be allowed for sum and prod anyway.

Here is the benchmark code
using BenchmarkTools, Printf

TT = (Float16, Float32, Float64)

n = 1000

println("| T | ", join(TT, " | "), " |")

@printf "| sum(v) |"
for T in TT
    v = rand(T, n)
    a = minimum(@benchmark sum($v))
    b = minimum(@benchmark @fastmath sum($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| sum(v; init=T(0)) |"
for T in TT
    v = rand(T, n)
    i = zero(T)
    a = minimum(@benchmark sum($v; init=$i))
    b = minimum(@benchmark @fastmath sum($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v) |"
for T in TT
    v = rand(T, n)
    a = minimum(@benchmark prod($v))
    b = minimum(@benchmark @fastmath prod($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v; init=T(1)) |"
for T in TT
    v = rand(T, n)
    i = one(T)
    a = minimum(@benchmark prod($v; init=$i))
    b = minimum(@benchmark @fastmath prod($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    v = rand(T, n)
    a = minimum(@benchmark extrema($v))
    b = minimum(@benchmark @fastmath extrema($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v; init=(T(1),T(0))) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    v = rand(T, n)
    i = (one(T), zero(T))
    a = minimum(@benchmark extrema($v; init=$i))
    b = minimum(@benchmark @fastmath extrema($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rc, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    A = rand(T, n, n)
    ra = fill((T(0),T(0)), n)
    rb = copy(ra)
    a = minimum(@benchmark extrema!($ra, $A))
    b = minimum(@benchmark @fastmath extrema!($rb, $A))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rr, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    A = rand(T, n, n)
    ra = fill((T(0),T(0)), 1, n)
    rb = copy(ra)
    a = minimum(@benchmark extrema!($ra, $A))
    b = minimum(@benchmark @fastmath extrema!($rb, $A))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@giordano
Copy link
Contributor

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary

☝️

@Seelengrab
Copy link
Contributor

Could one always use @fastmath for sum and prod?. When I tried it out, the fast versions dealt with NaN correctly.

If that were possible, we'd already do that. It's not just NaN though - naive use of @fastmath (especially on things like reductions) can lead to horrible numerical accuracy, due to the reassociations that may not be numerically good. I'd wager that using @fastmath by default breaks this, for example, but there are bound to be others.

Rearranging terms should be allowed for sum and prod anyway.

Yeah, but since floating point values are not associative, the reassociations may lead to catastrophic cancellation and subsequent loss of accuracy. It "works" for sum and prod because those are (on x86 anyway) already doing the correct thing and SIMD, due to intel actually implementing addition correctly (and this is also why you don't see a speedup for these two). For other things like min, intel got lazy and thought they could save some hardware ressources by only propagating the right operand if either one is a NaN (which is incorrect - NaN should be propagated either way).

That's why @fastmath is opt-into, why the global flag became a no-op and as well as why there's discussion to further expose the LLVM flags it uses under the hood.

The issue with #47216 is more so with figuring out why exactly our implementation with a default value doesn't end up emitting vectorinstructions, rather than slapping @fastmath on it and calling it a day. It should have enough information to do that after all.

@Seelengrab
Copy link
Contributor

Seelengrab commented May 22, 2023

I've modified your benchmarking script a bit, to ensure each evaluation starts on fresh & newly generated data, so that there's no chance of caching effects arbitrarily inflating some numbers. It's quite unlikely that repeatedly summing/multiplying the same array over and over again is a regular occurence.

I've also noticed that n = 1000 is quite small - that's just 2/4/8 KB of data, more than enough to comfortably fit in my L1 cache (384KiB) multiple times over.

using BenchmarkTools, Printf

TT = (Float16, Float32, Float64)

n = 1000

println("| T | ", join(TT, " | "), " |")

@printf "| sum(v) |"
for T in TT
    a = minimum(@benchmark sum(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(sum(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| sum(v; init=T(0)) |"
for T in TT
    a = minimum(@benchmark sum(v; init=i) setup=(v=rand($T,n); i=rand($T)) evals=1)
    b = minimum(@benchmark @fastmath(sum(v; init=i)) setup=(v=rand($T,n); i=rand($T)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v) |"
for T in TT
    a = minimum(@benchmark prod(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(prod(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v; init=T(1)) |"
for T in TT
    a = minimum(@benchmark prod(v; init=i) setup=(v=rand($T,n); i=rand($T)) evals=1)
    b = minimum(@benchmark @fastmath(prod(v; init=i)) setup=(v=rand($T,n); i=rand($T)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v; init=(T(1),T(0))) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema(v; init=i) setup=(v=rand($T,n); i=(rand($T),rand($T))) evals=1)
    b = minimum(@benchmark @fastmath(extrema(v; init=i)) setup=(v=rand($T,n); i=(rand($T),rand($T))) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rc, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema!(ra, A) setup=(A=rand($T,n,n); ra=fill(($T(0),$T(0)), n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema!(rb, A)) setup=(A=rand($T,n,n); rb=fill(($T(0),$T(0)), n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rr, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema!(ra, A) setup=(A=rand($T,n,n); ra=fill(($T(0),$T(0)), 1, n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema!(rb, A)) setup=(A=rand($T,n,n); rb=fill(($T(0),$T(0)), 1, n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

This is the result:

Function Float16 Float32 Float64
sum(v) 1.0 1.0 1.0
sum(v; init=T(0)) 25.3 11.0 9.3
prod(v) 1.0 1.0 1.0
prod(v; init=T(1)) 26.9 9.8 8.3
extrema(v) N/A 60.6 32.9
extrema(v; init=(T(1),T(0))) N/A 42.8 33.1
extrema!(rc, a) N/A 1.1 1.1
extrema!(rr, a) N/A 82.7 31.0

My machine is

Julia Version 1.10.0-DEV.1348
Commit f8f236b4bc (2023-05-21 21:16 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 34 on 24 virtual cores
Environment:
  JULIA_PKG_USE_CLI_GIT = true

I especially want to note that I get worse speedup than you on prod(v; init=T(1)) (and indeed, worse than I'd expect of a proper SIMD implementation). Another thing to keep in mind is that rand(Float64) doesn't produce values on the entire range of floating point values, but only in [0, 1), so there's no chance at all of floating point precision being lost (which definitely must be taken into consideration when talking about general sum/prod and needs to be tested).

@matthias314
Copy link
Contributor Author

Thanks for the detailed explanations! I didn't worry about floating point precision because my understanding was (and is) that also the current implementations of sum and prod add up elements in a way that only depends on the length of the input, not on the actual values. Whatever way is choosen, there will always be some input on which it performs poorly.

I especially want to note that I get worse speedup than you on prod(v; init=T(1)) (and indeed, worse than I'd expect of a proper SIMD implementation).

The speedup for prod(v; init=T(1)) makes it as fast as prod(v) on my machine. Do you think one can get faster than that?

Anyway, how shall we move forward? If you don't want @fastmath for sum and prod, then we can either forget about the PR or restrict it to extrema and extrema!.

@giordano
Copy link
Contributor

the current implementations of sum and prod add up elements in a way that only depends on the length of the input, not on the actual values.

What does this mean? That sum([1.0, 2.0]) == sum([3.0, 4.0])?

@matthias314
Copy link
Contributor Author

I mean that sum([x1,x2,x3]) boils down to a fixed evaluation order like (x1+x2)+x3. If I know the evaluation order, I can cook up a bad input. Changing the evaluation to x1+(x2+x3) simply permutes what the bad inputs are. Given that sum(v; kw...) essentially calls reduce(+, v; kw...), I don't expect there to be any (potentially existing) clever evaluation in our code that minimizes the total number of bad inputs. Please correct me if I'm wrong.

@Seelengrab
Copy link
Contributor

The speedup for prod(v; init=T(1)) makes it as fast as prod(v) on my machine. Do you think one can get faster than that?

I'm not at the computer I measure the previous results at right now, so I'll have to defer until then. I don't know whether prod(v) already hits peak-FLOPS, but all things being equal, I would have expected to see at least a bigger speedup for Float32 compared to Float16 and Float64, since the avx512 instructions cleanly double in capacity if you half the bitwidth of floating point.

Given that sum(v; kw...) essentially calls reduce(+, v; kw...), I don't expect there to be any (potentially existing) clever evaluation in our code that minimizes the total number of bad inputs. Please correct me if I'm wrong.

We don't currently do anything like kahan summation by default I don't think, mostly due to performance reasons, but at the same time we also don't guarantee that we don't do that. Additionally, we don't guarantee not to thread/parallelize such a summation by default, which would change associations.

Moreover, by using different blocking strategies, you can VASTLY improve performance of even slightly more complicated functions than just + (or even that) due to caching effects, better register/value reuse, fewer cache evictions etc. etc. As such, while I'm not in principle opposed to @fastmath sum, I don't think it's necessarily a good/easy win to strive for - if we wan't vectorized regular sum, @fastmath won't help with that, and recommending using @fastmath instead of fixing the issue with init seems backwards, seeing as we otherwise discourage its use unless it's known to be appropriate (which most naive use really isn't). If you truly want to squeeze out performance, there's just no way around doing something smarter than doing NaN/Inf-ignoring, reassociating reduce(+, ...) (especially on multidimensional data).

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented May 23, 2023

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary

I don't see it, at least in this version, but shouldn't all calculations be done in Float64?

I mean for the lower precision Float32 etc. converted to Float64, summed, then converted back to the old type?

I believe Float64 is mandated by IEEE; would all CPU hardware have as many such float units? With SIMD, you work on registers of a certain size, and you can sum as many such, but they are going to hold fewer Float64, so maybe this is actually slower even on CPUs.

Since this is quite hardware dependent

Yes, at least on GPUs where I'm more worried: you do not have as many Float64 units (or none at all, is that still a problem, you didn't always have such historically). Julia doesn't strictly support GPUs so possibly not a problem, though CUDA 10.1 is listed as Tier 1. I think that means that package will not be broken. Maybe such a change would, or make it slower (not on CUDA 10.1?), but it could at least be updated to work.

I was looking at the discourse post:

Julia currently uses a pairwise summation algorithm, which is much better than naive left-to-right reduction while having very nearly the same speed. There is also a sum_kbn function

I.e. you would get more accuracy out of the "pairwise summation algorithm", for sum, never rounded. Is it always allowed to do better? I mean faster is ok, but for sum, more accurate, non-bit-identical is ok?

@mikmoore
Copy link
Contributor

mikmoore commented May 23, 2023

I'd rather not incentivize people to use more @fastmath.

sum is not documented to use any particular association order, and in the huge majority of applications the user likely doesn't care. Further, ?reduce (whose family I consider sum to be a part of) explicitly documents that "The associativity of the reduction is implementation dependent." If someone insists on a particular evaluation order, they should use something explicit like foldl(+,x) instead. So I would argue that we currently have some flexibility here.

I think that very few users would care about @fastmath sum if we could ensure that sum (and mapreduce, more broadly) used the reassoc fastmath flag on the reduction. Most of the speedup (and almost all of the mostly-safe speedup) comes from SIMD enabled by reassoc. The remaining flags are prone to causing trouble (e.g., #49387). See also #49890.

I'm not opposed to @fastmath being able to penetrate mapreduce and friends, but we should be sure that reassociation is permitted even without it (#48129). That way people won't be so tempted to apply @fastmath. Hopefully, that means people will do it less unless they really think it's safe and important due to other operations within the reduction.

@brenhinkeller brenhinkeller added performance Must go faster maths Mathematical functions labels Aug 5, 2023
@PallHaraldsson
Copy link
Contributor

Bump, was approved so merge? I didn't read carefully possible objections. Was this forgotten?

@fingolfin
Copy link
Member

This was bumped last November, bumping again.

I also see no clear objection, though the last comment by @mikmoore seems to be critical, but I am not sure if it is meant as an actual objection to this PR. Perhaps @mikmoore would like to clarify?

@mikmoore
Copy link
Contributor

On review, I would say that I am against @fastmath for sum/prod but accepting of extrema. Discussion follows:

There isn't a performance reason for @fastmath to be applicable to sum/prod (with identity as the predicate -- it could accelerate some actual function but that's a different question). Any speedup there is purely due to reassociation (or UB, if the compiler knows some inputs are invalid and decides to eliminate whole operations). Reassociation is already semantically permitted (and used when init isn't specified, which is why that case gets no speedup), the issue is just that the implementation with init fails to use it. Fixing init is the solution, not @fastmath.

extrema does have a reasonable performance reason for @fastmath (permitting it to mis-handle NaN and the difference between -0.0 and +0.0 can make it much faster on most architectures due to the available instructions). Much of the speedup suggested here would be recovered by #45581, but that PR is not quite finished and I haven't touched it in over a year now so it's hard to say this should wait indefinitely on that. @fastmath would still be faster than that PR could ever achieve (thanks to the aforementioned mis-handling). Also, (it appears?) we already have @fastmath minimum/maximum, so this would be consistent with those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants