-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@fastmath
support for sum
, prod
, extrema
and extrema!
#49910
base: master
Are you sure you want to change the base?
Conversation
A Since this is quite hardware dependent, can you share your benchmarking code so others can corroborate the speedups? |
Could one always use Here is the benchmark code
|
☝️ |
If that were possible, we'd already do that. It's not just
Yeah, but since floating point values are not associative, the reassociations may lead to catastrophic cancellation and subsequent loss of accuracy. It "works" for That's why The issue with #47216 is more so with figuring out why exactly our implementation with a default value doesn't end up emitting vectorinstructions, rather than slapping |
I've modified your benchmarking script a bit, to ensure each evaluation starts on fresh & newly generated data, so that there's no chance of caching effects arbitrarily inflating some numbers. It's quite unlikely that repeatedly summing/multiplying the same array over and over again is a regular occurence.
n = 1000 is quite small - that's just 2/4/8 KB of data, more than enough to comfortably fit in my L1 cache (384KiB) multiple times over. |
Function | Float16 | Float32 | Float64 |
---|---|---|---|
sum(v) | 1.0 | 1.0 | 1.0 |
sum(v; init=T(0)) | 25.3 | 11.0 | 9.3 |
prod(v) | 1.0 | 1.0 | 1.0 |
prod(v; init=T(1)) | 26.9 | 9.8 | 8.3 |
extrema(v) | N/A | 60.6 | 32.9 |
extrema(v; init=(T(1),T(0))) | N/A | 42.8 | 33.1 |
extrema!(rc, a) | N/A | 1.1 | 1.1 |
extrema!(rr, a) | N/A | 82.7 | 31.0 |
My machine is
Julia Version 1.10.0-DEV.1348
Commit f8f236b4bc (2023-05-21 21:16 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 34 on 24 virtual cores
Environment:
JULIA_PKG_USE_CLI_GIT = true
I especially want to note that I get worse speedup than you on prod(v; init=T(1))
(and indeed, worse than I'd expect of a proper SIMD implementation). Another thing to keep in mind is that rand(Float64)
doesn't produce values on the entire range of floating point values, but only in [0, 1)
, so there's no chance at all of floating point precision being lost (which definitely must be taken into consideration when talking about general sum
/prod
and needs to be tested).
Thanks for the detailed explanations! I didn't worry about floating point precision because my understanding was (and is) that also the current implementations of
The speedup for Anyway, how shall we move forward? If you don't want |
What does this mean? That |
I mean that |
I'm not at the computer I measure the previous results at right now, so I'll have to defer until then. I don't know whether
We don't currently do anything like kahan summation by default I don't think, mostly due to performance reasons, but at the same time we also don't guarantee that we don't do that. Additionally, we don't guarantee not to thread/parallelize such a summation by default, which would change associations. Moreover, by using different blocking strategies, you can VASTLY improve performance of even slightly more complicated functions than just |
I don't see it, at least in this version, but shouldn't all calculations be done in Float64? I mean for the lower precision Float32 etc. converted to Float64, summed, then converted back to the old type? I believe Float64 is mandated by IEEE; would all CPU hardware have as many such float units? With SIMD, you work on registers of a certain size, and you can sum as many such, but they are going to hold fewer Float64, so maybe this is actually slower even on CPUs.
Yes, at least on GPUs where I'm more worried: you do not have as many Float64 units (or none at all, is that still a problem, you didn't always have such historically). Julia doesn't strictly support GPUs so possibly not a problem, though CUDA 10.1 is listed as Tier 1. I think that means that package will not be broken. Maybe such a change would, or make it slower (not on CUDA 10.1?), but it could at least be updated to work. I was looking at the discourse post:
I.e. you would get more accuracy out of the "pairwise summation algorithm", for sum, never rounded. Is it always allowed to do better? I mean faster is ok, but for sum, more accurate, non-bit-identical is ok? |
I'd rather not incentivize people to use more
I think that very few users would care about I'm not opposed to |
Bump, was approved so merge? I didn't read carefully possible objections. Was this forgotten? |
On review, I would say that I am against There isn't a performance reason for
|
This PR extends #48153 by adding
@fastmath
support forsum
,prod
,extrema
andextrema!
. This would provide a solution to #47216.Currently,
@fastmath
has no effect for the functions mentioned above. With this PR the speedup factor isHere
T
is the given float type,n = 1000
,v = rand(T, n)
,a = rand(T, n, n)
,rc = fill((T(0),T(0)), n)
andrr = fill((T(0),T(0)), 1, n)
. At present, the extrema functions crash forFloat16
because of #49907. For the same reason, the corresponding tests are currently skipped in the test file. I have also tried fast versions ofsum!
andprod!
, but I couldn't get any improvement there.