-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use native fmin/fmax in aarch64 #47814
Conversation
base/math.jl
Outdated
function min(x::T, y::T) where {T<:Union{Float32,Float64}} | ||
@static if Sys.ARCH === :aarch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add a has_correct_min(T)
to correspond with has_fma(T)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will also need to be marked consistent
base/math.jl
Outdated
jlop,llvmop = op | ||
for type in _minmaxtypes | ||
fntyp, jltyp, llvmtyp = type | ||
@eval @inline function $(Symbol("llvm_", jlop))(x::$jltyp,y::$jltyp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need effects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked, but it's very likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this total? Or foldable + nothrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be total. (although I think those are the same)
base/math.jl
Outdated
($"""declare $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp,$llvmtyp) | ||
define $llvmtyp @entry($llvmtyp,$llvmtyp) #0 { | ||
2: | ||
%3 = call $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp %0, $llvmtyp %1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We export these normally through intrinsics. Compare Core.Intrinsics.sqrt_llvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that not all backends support it. Do we then call min from libm if someone hits that? Or just let llvm throw it's lowering error because they are doing wrong stuff anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowering to libm is fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do more complicated lowering at some-point, and in theory we could lower back to Julia, but that is overcomplicating things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually lowering to libm is wrong, because libm min is different from ours because it doesn't respect the NaN handling we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then see injectCRT
and julia__gnu_h2f_ieee
to how we handle the compiler intrinsics. LLVM inserts those late during the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just implement min in C like we do with fma
base/math.jl
Outdated
function min(x::T, y::T) where {T<:Union{Float32,Float64}} | ||
if has_native_fminmax | ||
return llvm_min(x,y) | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation below is off now. Maybe just end
here?
base/math.jl
Outdated
Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y) | ||
Base.@assume_effects :total @inline llvm_min(x::Float32, y::Float32) = ccall("llvm.minimum.f32", llvmcall, Float32, (Float32, Float32), x, y) | ||
Base.@assume_effects :total @inline llvm_max(x::Float64, y::Float64) = ccall("llvm.maximum.f64", llvmcall, Float64, (Float64, Float64), x, y) | ||
Base.@assume_effects :total @inline llvm_max(x::Float32, y::Float32) = ccall("llvm.maximum.f32", llvmcall, Float32, (Float32, Float32), x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can define these functions only if has_native_fminmax
?
How did you measure that? I didn't see any difference in some quick benchmarks I did yesterday on M1 (although using directly hardware instructions is good anyway). |
I was comparing to 1.8, I guess somewhere in 1.9 it got a bit faster, it's only about 10% faster :p a = rand(100000)
b = rand(100000)
c = similar(a)
@btime $c = min.($a,$b) |
After this, we should also revisit I have a PR for improving |
Co-authored-by: Valentin Churavy <[email protected]>
This might not change that much in your case, it's just an optimization that LLVM isn't seeing for aarch64 and maybe others though I can't test. |
Merge? |
For future reference, on M1 I get (
On A64FX:
There is an improvement, but way less than the initially promised 2.5x. What's weird is that on A64FX reduction gets sensibly slower:
Instead on M1 I see a small improvement also for reductions. |
I wonder if there is a vectorization difference on the a64fx maybe SVE doesn't support it? Can you make a for loop version so that we could analyze a bit further? |
I used function my_reduce(f, v, init)
out = init
@simd for x in v
out = f(out, x)
end
return out
end On A64FX:
For comparison, on M1:
|
That's doubly weird, it's a shame the mapreduce machinery is probably hiding something here. Does profiling show a difference somewhere? |
Ok, it turns out that
After redefinition of At this point I'd say this PR is a net (marginal) improvement on both CPUs. |
Thanks for looking into it! |
Failures on Windows are unrelated (#48101), I'm going to merge this, thanks Gabriel! |
Aarch64 has native fmin/fmax instructions, that follow IEEE. Since we use our own min function we don't lower to that. It's about 2.5x faster in my experiments.
I'm not sure this is the cleanest way of doing this.