Use native fmin/fmax in aarch64 #47814

gbaraldi · 2022-12-06T16:17:40Z

Aarch64 has native fmin/fmax instructions, that follow IEEE. Since we use our own min function we don't lower to that. It's about 2.5x faster in my experiments.

I'm not sure this is the cleanest way of doing this.

oscardssmith · 2022-12-06T16:47:26Z

base/math.jl

 function min(x::T, y::T) where {T<:Union{Float32,Float64}}
+    @static if Sys.ARCH === :aarch64


we might want to add a has_correct_min(T) to correspond with has_fma(T)

I think this will also need to be marked consistent

oscardssmith · 2022-12-06T16:50:28Z

base/math.jl

+    jlop,llvmop = op
+    for type in _minmaxtypes
+        fntyp, jltyp, llvmtyp = type
+        @eval @inline function $(Symbol("llvm_", jlop))(x::$jltyp,y::$jltyp)


does this need effects?

I haven't checked, but it's very likely.

Is this total? Or foldable + nothrow?

should be total. (although I think those are the same)

base/math.jl

vchuravy · 2022-12-06T19:07:57Z

base/math.jl

+                    ($"""declare $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp,$llvmtyp)
+                        define $llvmtyp @entry($llvmtyp,$llvmtyp) #0 {
+                        2:
+                            %3 = call $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp %0, $llvmtyp %1)


We export these normally through intrinsics. Compare Core.Intrinsics.sqrt_llvm

The issue is that not all backends support it. Do we then call min from libm if someone hits that? Or just let llvm throw it's lowering error because they are doing wrong stuff anyway?

Lowering to libm is fine by me.

I tried to do more complicated lowering at some-point, and in theory we could lower back to Julia, but that is overcomplicating things.

Actually lowering to libm is wrong, because libm min is different from ours because it doesn't respect the NaN handling we do.

Then see injectCRT and julia__gnu_h2f_ieee to how we handle the compiler intrinsics. LLVM inserts those late during the backend.

I would just implement min in C like we do with fma

giordano · 2022-12-06T23:47:37Z

base/math.jl

 function min(x::T, y::T) where {T<:Union{Float32,Float64}}
+    if has_native_fminmax
+        return llvm_min(x,y)
+    else


Indentation below is off now. Maybe just end here?

giordano · 2022-12-06T23:50:32Z

base/math.jl

+Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y)
+Base.@assume_effects :total @inline llvm_min(x::Float32, y::Float32) = ccall("llvm.minimum.f32", llvmcall, Float32, (Float32, Float32), x, y)
+Base.@assume_effects :total @inline llvm_max(x::Float64, y::Float64) = ccall("llvm.maximum.f64", llvmcall, Float64, (Float64, Float64), x, y)
+Base.@assume_effects :total @inline llvm_max(x::Float32, y::Float32) = ccall("llvm.maximum.f32", llvmcall, Float32, (Float32, Float32), x, y)


Can define these functions only if has_native_fminmax?

giordano · 2022-12-07T11:13:41Z

It's about 2.5x faster in my experiments.

How did you measure that? I didn't see any difference in some quick benchmarks I did yesterday on M1 (although using directly hardware instructions is good anyway).

gbaraldi · 2022-12-07T12:54:10Z

I was comparing to 1.8, I guess somewhere in 1.9 it got a bit faster, it's only about 10% faster :p

a = rand(100000)
b = rand(100000)
c = similar(a)
@btime $c = min.($a,$b)

mikmoore · 2022-12-07T19:13:21Z

After this, we should also revisit minimum/maximum. The current implementations are a mess that introduces all sorts of issues and brittleness (including accidentally making them almost 3x slower for integers... whoops). These issues would be trivially fixed with native min/max.

I have a PR for improving minimum/maximum using other means #45581 but I haven't made much progress lately (last time I worked on it I was stuck trying to fix minimum!/maximum! performance). This PR would make superseding that PR very easy on participating architectures (just remove the current specializations), although other architectures would still benefit from #45581. Hopefully I can bring myself to spend some more time on it over the holidays...

base/math.jl

Co-authored-by: Valentin Churavy <[email protected]>

gbaraldi · 2022-12-07T20:19:14Z

This might not change that much in your case, it's just an optimization that LLVM isn't seeing for aarch64 and maybe others though I can't test.

ViralBShah · 2022-12-18T12:20:26Z

Merge?

giordano · 2023-01-02T17:28:33Z

For future reference, on M1 I get (min/max are the current implementations on master, llvm_min/llvm_max are the functions introduced in this PR):

julia> @benchmark c .= min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9526 samples with 1 evaluation.
 Range (min … max):  27.708 μs … 239.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.188 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.250 μs ±  12.000 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▅▁█▂▁
  ▇▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▁▁▁▂▂▂▂▃▄▄▆██████▇▅▅▅▄▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  27.7 μs         Histogram: frequency by time         98.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9554 samples with 1 evaluation.
 Range (min … max):  24.709 μs … 265.084 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.459 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.242 μs ±  11.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▁▅▄██▃▃
  █▂▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▁▂▂▂▂▂▃▄▅▇████████▇▆▅▅▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  24.7 μs         Histogram: frequency by time         94.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9525 samples with 1 evaluation.
 Range (min … max):  27.708 μs … 137.833 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.671 μs ±  10.734 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▂▄▃█▁▁
  ▆▂▂▂▂▂▂▂▁▂▂▁▁▁▂▁▂▁▂▁▁▂▂▂▂▄▄▅▇██████▇▅▅▄▄▄▃▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  27.7 μs         Histogram: frequency by time         97.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9657 samples with 1 evaluation.
 Range (min … max):  24.791 μs … 228.416 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.896 μs ±  11.526 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                ▂▄▄█▃▂
  ▇▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▁▁▁▂▁▂▂▂▂▂▄▄▆▇███████▅▅▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  24.8 μs         Histogram: frequency by time         93.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

On A64FX:

julia> @benchmark c .= min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2920 samples with 1 evaluation.
 Range (min … max):  47.011 μs … 133.952 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.740 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.823 μs ±  12.122 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃
  ██▁▁█▄▃▁▃▁▁▁▃▁▅▁▃▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▃▁▄▅▄▃▁▅▃▁▁▁▁▁▃▇ █
  47 μs         Histogram: log(frequency) by time       130 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2928 samples with 1 evaluation.
 Range (min … max):  42.890 μs … 127.772 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.820 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.818 μs ±  11.797 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄
  ██▆▅▆█▃▁▃▁▁▁▃▃▃▃▄▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▃▆▄▃▁▃▃▄▁▁▁▅▇ █
  42.9 μs       Histogram: log(frequency) by time       124 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2917 samples with 1 evaluation.
 Range (min … max):  46.791 μs … 153.081 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.730 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.069 μs ±  12.434 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄ ▁ ▁
  ██▄█▆█▄▄▁▁▁▁▄▁▄▄▁▆▅▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▄▃▁▃▅▅▃▁▄▃▁▁▃▁▁▅▇ █
  46.8 μs       Histogram: log(frequency) by time       130 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2926 samples with 1 evaluation.
 Range (min … max):  42.820 μs … 150.722 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.760 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.785 μs ±  11.994 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅
  ██▆▄█▇▃▁▁▁▁▄▅▃▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▁▄▄▆▁▃▃▃▁▁▁▁▅▇ █
  42.8 μs       Histogram: log(frequency) by time       124 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

There is an improvement, but way less than the initially promised 2.5x.

What's weird is that on A64FX reduction gets sensibly slower:

julia> a=randn(100_000);

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.562 μs …  1.506 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     191.882 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.190 μs ± 13.210 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄█
  ▅██▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  192 μs          Histogram: frequency by time          201 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.642 μs … 279.223 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.412 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.638 μs ±   1.409 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▇▆▆▃▃
  ▃▆████████▄▃▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  233 μs           Histogram: frequency by time          241 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(min, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.751 μs … 232.872 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     191.981 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.143 μs ±   1.226 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▃                                                          ▁
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▇▇▇▇ █
  192 μs        Histogram: log(frequency) by time        200 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_min, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.982 μs … 258.642 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     234.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   234.576 μs ±   1.303 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          █▂
  ▂▂▂▂▃▃▄████▅▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  233 μs           Histogram: frequency by time          243 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Instead on M1 I see a small improvement also for reductions.

gbaraldi · 2023-01-02T18:13:47Z

I wonder if there is a vectorization difference on the a64fx maybe SVE doesn't support it? Can you make a for loop version so that we could analyze a bit further?

giordano · 2023-01-02T18:27:11Z

I used

function my_reduce(f, v, init)
    out = init
    @simd for x in v
        out = f(out, x)
    end
    return out
end

On A64FX:

julia> a=randn(100_000);

julia> @benchmark my_reduce(max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 2215 samples with 1 evaluation.
 Range (min … max):  2.245 ms … 2.282 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.247 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.249 ms ± 4.297 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃██▇▁
  ▂▃▃▆█████▆▅▄▃▂▁▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂ ▃
  2.25 ms        Histogram: frequency by time       2.26 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark my_reduce(llvm_max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  223.873 μs … 250.233 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     224.373 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   224.566 μs ±   1.284 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▄                                                         ▁
  ▆▄▅██▇▅▄▁▃▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▄▅▇▆▆▇▆▇ █
  224 μs        Histogram: log(frequency) by time        232 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.781 μs … 232.362 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     192.021 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.191 μs ±   1.343 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▁
  ▄██▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  192 μs           Histogram: frequency by time          200 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.372 μs … 254.252 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.202 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.465 μs ±   1.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆▅▃█▁
  ▂▇███████▆▅▅▇▅▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  232 μs           Histogram: frequency by time          241 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

my_reduce(llvm_max is much better than my_reduce(max, and comparable with reduce(llvm_max, but slower than reduce(max.

For comparison, on M1:

julia> @benchmark my_reduce(max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 9845 samples with 1 evaluation.
 Range (min … max):  499.542 μs … 639.291 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     499.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   504.502 μs ±   8.448 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂▁▄▅  ▁▁                      ▁▃▃▁▁▁ ▁▁       ▁▁           ▁
  ████████████▇▆▇▆█▆▇▇▆▇▆▇▇▇▇▇▇█████████████▇▇▇▇▇▇▇██▇▇▅▅▃▅▄▄▄▅ █
  500 μs        Histogram: log(frequency) by time        528 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark my_reduce(llvm_max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  62.583 μs … 166.792 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     62.709 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.508 μs ±   2.732 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃         ▄           ▁                                     ▁
  ██▅▇▇▇█▇▇▇▇█▇▇██▅▆▆▆▆▇▆█▆▅▅▄▅▅▅▅▄▄▅▄▅▅▄▄▅▄▄▃▅▅▅▅▅▆▆▆▅▅▅▄▄▅▄▆ █
  62.6 μs       Histogram: log(frequency) by time      74.6 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  63.750 μs … 95.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.875 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.527 μs ±  2.195 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄           ▁▁                                             ▁
  ██▆▇▇▇██▇▇▇▇▇██▆▆▅▄▆▆▆▆▅▅▃▄▄▅▆▆▆▆▄▅▆▄▅▅▄▅▄▄▆▄▅▃▅▆▆▆▅▅▅▅▅▅▄▆ █
  63.8 μs      Histogram: log(frequency) by time        76 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  57.541 μs … 169.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.708 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.827 μs ±   3.136 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄        ▅▁  ▁      ▃                                       ▁
  ██▆██████▇██████▇▇▇▇██▇█▆▆▅▆▇▆▆▆▅▅▅▅▅▅▆▅▄▅▅▅▅▄▅▆▆▆▆▆▆▅▆▆▅▄▅▄ █
  57.5 μs       Histogram: log(frequency) by time      69.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

gbaraldi · 2023-01-02T20:18:55Z

That's doubly weird, it's a shame the mapreduce machinery is probably hiding something here. Does profiling show a difference somewhere?

giordano · 2023-01-02T21:10:19Z

Ok, it turns out that reduce(max, a) is faster than reduce(llvm_max, a) because reduce(::typeof(max), a) is somehow internally optimised, even if it calls the same reduction operator as reduce(::typeof(some_other_function), a):

julia> @benchmark reduce(max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4939 samples with 1 evaluation.
 Range (min … max):  189.862 μs …  1.354 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     190.961 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   191.515 μs ± 16.729 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▆█▇  ▄▁▂▁▁                                                 ▁
  █████▇▆█████▆▁▃▁▁▁▁▁▁▃▃▁▁▁▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▅▃▅▅▄▅▆ █
  190 μs        Histogram: log(frequency) by time       208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4638 samples with 1 evaluation.
 Range (min … max):  232.332 μs … 285.223 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.619 μs ±   2.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▇▇▆▃                                                       ▂
  ████████▃▃▁▁▁▃▆▇▇▇▃█▇▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▅▅▄▁▆▄▄▆▇▆▆▇ █
  232 μs        Histogram: log(frequency) by time        250 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Base.max(x::T, y::T) where {T<:Union{Float32,Float64}} = llvm_max(x, y)

julia> @benchmark reduce(max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4848 samples with 1 evaluation.
 Range (min … max):  186.571 μs … 240.413 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     187.592 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   187.891 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇▂▅█
  ▇████▄▂▃▃▂▂▂▂▂▁▁▁▂▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▁▂▂▂▂▂▂▂▂ ▃
  187 μs           Histogram: frequency by time          205 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4637 samples with 1 evaluation.
 Range (min … max):  232.372 μs …  1.501 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.192 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   234.058 μs ± 18.959 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▇▇▅▁                                                      ▂
  ███████▆▁▁▁▁▅███▆▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▅▆▅▄▅▆▄▆▅▆▅▆▇▅ █
  232 μs        Histogram: log(frequency) by time       251 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

After redefinition of Base.max (so basically what this PR does), in principle reduce(max, a) should be the same as reduce(llvm_max, a), but performance is much different, which confused me.

At this point I'd say this PR is a net (marginal) improvement on both CPUs.

gbaraldi · 2023-01-02T21:20:32Z

Thanks for looking into it!

giordano · 2023-01-03T21:00:02Z

Failures on Windows are unrelated (#48101), I'm going to merge this, thanks Gabriel!

Use native fmin in aarch64

0eaf876

gbaraldi requested review from N5N3 and oscardssmith and removed request for N5N3 December 6, 2022 16:18

oscardssmith reviewed Dec 6, 2022

View reviewed changes

Small cleanup + effects

93f8273

giordano reviewed Dec 6, 2022

View reviewed changes

base/math.jl Outdated Show resolved Hide resolved

vchuravy reviewed Dec 6, 2022

View reviewed changes

Cleanup

9e61d9f

giordano reviewed Dec 6, 2022

View reviewed changes

brenhinkeller added the performance Must go faster label Dec 7, 2022

vchuravy reviewed Dec 7, 2022

View reviewed changes

base/math.jl Outdated Show resolved Hide resolved

Update base/math.jl

5603cbf

Co-authored-by: Valentin Churavy <[email protected]>

gbaraldi added 2 commits December 7, 2022 17:31

Some more cleanup

1b2a92e

Merge branch 'master' into fmin-aarch64

9e3471a

N5N3 mentioned this pull request Jan 2, 2023

Should there be @fastmath maximum etc? #48082

Closed

Merge branch 'JuliaLang:master' into fmin-aarch64

a8f0002

giordano merged commit 52af407 into JuliaLang:master Jan 3, 2023

mikmoore mentioned this pull request Feb 1, 2023

Optimize float min/max with constant propagation #48487

Closed

maleadt mentioned this pull request May 18, 2023

Use an intrinsic to determine fmin/fmax availability. #49872

Open

christiangnrd mentioned this pull request Jul 17, 2024

Test for min / max broadcasting issue JuliaGPU/Metal.jl#389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native fmin/fmax in aarch64 #47814

Use native fmin/fmax in aarch64 #47814

gbaraldi commented Dec 6, 2022

oscardssmith Dec 6, 2022

oscardssmith Dec 6, 2022

oscardssmith Dec 6, 2022

gbaraldi Dec 6, 2022

gbaraldi Dec 6, 2022

oscardssmith Dec 6, 2022

vchuravy Dec 6, 2022

gbaraldi Dec 6, 2022

vchuravy Dec 6, 2022

vchuravy Dec 6, 2022

gbaraldi Dec 6, 2022

vchuravy Dec 6, 2022

oscardssmith Dec 6, 2022

giordano Dec 6, 2022 •

edited

Loading

giordano Dec 6, 2022

giordano commented Dec 7, 2022

gbaraldi commented Dec 7, 2022

mikmoore commented Dec 7, 2022 •

edited

Loading

gbaraldi commented Dec 7, 2022

ViralBShah commented Dec 18, 2022

giordano commented Jan 2, 2023 •

edited

Loading

gbaraldi commented Jan 2, 2023

giordano commented Jan 2, 2023

gbaraldi commented Jan 2, 2023

giordano commented Jan 2, 2023

gbaraldi commented Jan 2, 2023

giordano commented Jan 3, 2023

		function min(x::T, y::T) where {T<:Union{Float32,Float64}}
		@static if Sys.ARCH === :aarch64

Use native fmin/fmax in aarch64 #47814

Use native fmin/fmax in aarch64 #47814

Conversation

gbaraldi commented Dec 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giordano Dec 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giordano commented Dec 7, 2022

gbaraldi commented Dec 7, 2022

mikmoore commented Dec 7, 2022 • edited Loading

gbaraldi commented Dec 7, 2022

ViralBShah commented Dec 18, 2022

giordano commented Jan 2, 2023 • edited Loading

gbaraldi commented Jan 2, 2023

giordano commented Jan 2, 2023

gbaraldi commented Jan 2, 2023

giordano commented Jan 2, 2023

gbaraldi commented Jan 2, 2023

giordano commented Jan 3, 2023

giordano Dec 6, 2022 •

edited

Loading

mikmoore commented Dec 7, 2022 •

edited

Loading

giordano commented Jan 2, 2023 •

edited

Loading