Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use native fmin/fmax in aarch64 #47814

Merged
merged 7 commits into from
Jan 3, 2023
Merged

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Dec 6, 2022

Aarch64 has native fmin/fmax instructions, that follow IEEE. Since we use our own min function we don't lower to that. It's about 2.5x faster in my experiments.

I'm not sure this is the cleanest way of doing this.

@gbaraldi gbaraldi requested review from N5N3 and oscardssmith and removed request for N5N3 December 6, 2022 16:18
base/math.jl Outdated
function min(x::T, y::T) where {T<:Union{Float32,Float64}}
@static if Sys.ARCH === :aarch64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to add a has_correct_min(T) to correspond with has_fma(T)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will also need to be marked consistent

base/math.jl Outdated
jlop,llvmop = op
for type in _minmaxtypes
fntyp, jltyp, llvmtyp = type
@eval @inline function $(Symbol("llvm_", jlop))(x::$jltyp,y::$jltyp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need effects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked, but it's very likely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this total? Or foldable + nothrow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be total. (although I think those are the same)

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated
($"""declare $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp,$llvmtyp)
define $llvmtyp @entry($llvmtyp,$llvmtyp) #0 {
2:
%3 = call $llvmtyp @llvm.$llvmop.$fntyp($llvmtyp %0, $llvmtyp %1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We export these normally through intrinsics. Compare Core.Intrinsics.sqrt_llvm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that not all backends support it. Do we then call min from libm if someone hits that? Or just let llvm throw it's lowering error because they are doing wrong stuff anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowering to libm is fine by me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do more complicated lowering at some-point, and in theory we could lower back to Julia, but that is overcomplicating things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually lowering to libm is wrong, because libm min is different from ours because it doesn't respect the NaN handling we do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then see injectCRT and julia__gnu_h2f_ieee to how we handle the compiler intrinsics. LLVM inserts those late during the backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just implement min in C like we do with fma

base/math.jl Outdated
function min(x::T, y::T) where {T<:Union{Float32,Float64}}
if has_native_fminmax
return llvm_min(x,y)
else
Copy link
Contributor

@giordano giordano Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation below is off now. Maybe just end here?

base/math.jl Outdated
Comment on lines 854 to 857
Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y)
Base.@assume_effects :total @inline llvm_min(x::Float32, y::Float32) = ccall("llvm.minimum.f32", llvmcall, Float32, (Float32, Float32), x, y)
Base.@assume_effects :total @inline llvm_max(x::Float64, y::Float64) = ccall("llvm.maximum.f64", llvmcall, Float64, (Float64, Float64), x, y)
Base.@assume_effects :total @inline llvm_max(x::Float32, y::Float32) = ccall("llvm.maximum.f32", llvmcall, Float32, (Float32, Float32), x, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can define these functions only if has_native_fminmax?

@brenhinkeller brenhinkeller added the performance Must go faster label Dec 7, 2022
@giordano
Copy link
Contributor

giordano commented Dec 7, 2022

It's about 2.5x faster in my experiments.

How did you measure that? I didn't see any difference in some quick benchmarks I did yesterday on M1 (although using directly hardware instructions is good anyway).

@gbaraldi
Copy link
Member Author

gbaraldi commented Dec 7, 2022

I was comparing to 1.8, I guess somewhere in 1.9 it got a bit faster, it's only about 10% faster :p

a = rand(100000)
b = rand(100000)
c = similar(a)
@btime $c = min.($a,$b)

@mikmoore
Copy link
Contributor

mikmoore commented Dec 7, 2022

After this, we should also revisit minimum/maximum. The current implementations are a mess that introduces all sorts of issues and brittleness (including accidentally making them almost 3x slower for integers... whoops). These issues would be trivially fixed with native min/max.

I have a PR for improving minimum/maximum using other means #45581 but I haven't made much progress lately (last time I worked on it I was stuck trying to fix minimum!/maximum! performance). This PR would make superseding that PR very easy on participating architectures (just remove the current specializations), although other architectures would still benefit from #45581. Hopefully I can bring myself to spend some more time on it over the holidays...

base/math.jl Outdated Show resolved Hide resolved
Co-authored-by: Valentin Churavy <[email protected]>
@gbaraldi
Copy link
Member Author

gbaraldi commented Dec 7, 2022

This might not change that much in your case, it's just an optimization that LLVM isn't seeing for aarch64 and maybe others though I can't test.

@ViralBShah
Copy link
Member

Merge?

@giordano
Copy link
Contributor

giordano commented Jan 2, 2023

For future reference, on M1 I get (min/max are the current implementations on master, llvm_min/llvm_max are the functions introduced in this PR):

julia> @benchmark c .= min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9526 samples with 1 evaluation.
 Range (min … max):  27.708 μs … 239.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.188 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.250 μs ±  12.000 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▅▁█▂▁
  ▇▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▁▁▁▂▂▂▂▃▄▄▆██████▇▅▅▅▄▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  27.7 μs         Histogram: frequency by time         98.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9554 samples with 1 evaluation.
 Range (min … max):  24.709 μs … 265.084 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.459 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.242 μs ±  11.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▁▅▄██▃▃
  █▂▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▁▂▂▂▂▂▃▄▅▇████████▇▆▅▅▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  24.7 μs         Histogram: frequency by time         94.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9525 samples with 1 evaluation.
 Range (min … max):  27.708 μs … 137.833 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     65.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.671 μs ±  10.734 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▂▄▃█▁▁
  ▆▂▂▂▂▂▂▂▁▂▂▁▁▁▂▁▂▁▂▁▁▂▂▂▂▄▄▅▇██████▇▅▅▄▄▄▃▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  27.7 μs         Histogram: frequency by time         97.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 9657 samples with 1 evaluation.
 Range (min … max):  24.791 μs … 228.416 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.896 μs ±  11.526 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                ▂▄▄█▃▂
  ▇▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▁▁▁▂▁▂▂▂▂▂▄▄▆▇███████▅▅▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  24.8 μs         Histogram: frequency by time         93.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

On A64FX:

julia> @benchmark c .= min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2920 samples with 1 evaluation.
 Range (min … max):  47.011 μs … 133.952 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.740 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.823 μs ±  12.122 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃
  ██▁▁█▄▃▁▃▁▁▁▃▁▅▁▃▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▃▁▄▅▄▃▁▅▃▁▁▁▁▁▃▇ █
  47 μs         Histogram: log(frequency) by time       130 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_min.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2928 samples with 1 evaluation.
 Range (min … max):  42.890 μs … 127.772 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.820 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.818 μs ±  11.797 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄
  ██▆▅▆█▃▁▃▁▁▁▃▃▃▃▄▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▃▆▄▃▁▃▃▄▁▁▁▅▇ █
  42.9 μs       Histogram: log(frequency) by time       124 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2917 samples with 1 evaluation.
 Range (min … max):  46.791 μs … 153.081 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.730 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.069 μs ±  12.434 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄ ▁ ▁
  ██▄█▆█▄▄▁▁▁▁▄▁▄▄▁▆▅▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▄▃▁▃▅▅▃▁▄▃▁▁▃▁▁▅▇ █
  46.8 μs       Histogram: log(frequency) by time       130 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark c .= llvm_max.(a, b) setup=(a=randn(100_000); b=randn(100_000); c=similar(a)) evals=1
BenchmarkTools.Trial: 2926 samples with 1 evaluation.
 Range (min … max):  42.820 μs … 150.722 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.760 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.785 μs ±  11.994 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅
  ██▆▄█▇▃▁▁▁▁▄▅▃▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▁▄▄▆▁▃▃▃▁▁▁▁▅▇ █
  42.8 μs       Histogram: log(frequency) by time       124 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

There is an improvement, but way less than the initially promised 2.5x.

What's weird is that on A64FX reduction gets sensibly slower:

julia> a=randn(100_000);

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.562 μs …  1.506 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     191.882 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.190 μs ± 13.210 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄█
  ▅██▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  192 μs          Histogram: frequency by time          201 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.642 μs … 279.223 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.412 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.638 μs ±   1.409 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▇▆▆▃▃
  ▃▆████████▄▃▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  233 μs           Histogram: frequency by time          241 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(min, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.751 μs … 232.872 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     191.981 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.143 μs ±   1.226 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▃                                                          ▁
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▇▇▇▇ █
  192 μs        Histogram: log(frequency) by time        200 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_min, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.982 μs … 258.642 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     234.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   234.576 μs ±   1.303 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          █▂
  ▂▂▂▂▃▃▄████▅▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  233 μs           Histogram: frequency by time          243 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Instead on M1 I see a small improvement also for reductions.

@gbaraldi
Copy link
Member Author

gbaraldi commented Jan 2, 2023

I wonder if there is a vectorization difference on the a64fx maybe SVE doesn't support it? Can you make a for loop version so that we could analyze a bit further?

@giordano
Copy link
Contributor

giordano commented Jan 2, 2023

I used

function my_reduce(f, v, init)
    out = init
    @simd for x in v
        out = f(out, x)
    end
    return out
end

On A64FX:

julia> a=randn(100_000);

julia> @benchmark my_reduce(max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 2215 samples with 1 evaluation.
 Range (min … max):  2.245 ms … 2.282 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.247 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.249 ms ± 4.297 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃██▇▁
  ▂▃▃▆█████▆▅▄▃▂▁▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂ ▃
  2.25 ms        Histogram: frequency by time       2.26 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark my_reduce(llvm_max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  223.873 μs … 250.233 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     224.373 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   224.566 μs ±   1.284 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▄                                                         ▁
  ▆▄▅██▇▅▄▁▃▄▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▄▅▇▆▆▇▆▇ █
  224 μs        Histogram: log(frequency) by time        232 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.781 μs … 232.362 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     192.021 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.191 μs ±   1.343 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▁
  ▄██▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  192 μs           Histogram: frequency by time          200 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  232.372 μs … 254.252 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.202 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.465 μs ±   1.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆▅▃█▁
  ▂▇███████▆▅▅▇▅▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  232 μs           Histogram: frequency by time          241 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

my_reduce(llvm_max is much better than my_reduce(max, and comparable with reduce(llvm_max, but slower than reduce(max.

For comparison, on M1:

julia> @benchmark my_reduce(max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 9845 samples with 1 evaluation.
 Range (min … max):  499.542 μs … 639.291 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     499.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   504.502 μs ±   8.448 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂▁▄▅  ▁▁                      ▁▃▃▁▁▁ ▁▁       ▁▁           ▁
  ████████████▇▆▇▆█▆▇▇▆▇▆▇▇▇▇▇▇█████████████▇▇▇▇▇▇▇██▇▇▅▅▃▅▄▄▄▅ █
  500 μs        Histogram: log(frequency) by time        528 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark my_reduce(llvm_max, $a, typemin(eltype(a)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  62.583 μs … 166.792 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     62.709 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.508 μs ±   2.732 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃         ▄           ▁                                     ▁
  ██▅▇▇▇█▇▇▇▇█▇▇██▅▆▆▆▆▇▆█▆▅▅▄▅▅▅▅▄▄▅▄▅▅▄▄▅▄▄▃▅▅▅▅▅▆▆▆▅▅▅▄▄▅▄▆ █
  62.6 μs       Histogram: log(frequency) by time      74.6 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark reduce(max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  63.750 μs … 95.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     63.875 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.527 μs ±  2.195 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄           ▁▁                                             ▁
  ██▆▇▇▇██▇▇▇▇▇██▆▆▅▄▆▆▆▆▅▅▃▄▄▅▆▆▆▆▄▅▆▄▅▅▄▅▄▄▆▄▅▃▅▆▆▆▅▅▅▅▅▅▄▆ █
  63.8 μs      Histogram: log(frequency) by time        76 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  57.541 μs … 169.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.708 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.827 μs ±   3.136 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄        ▅▁  ▁      ▃                                       ▁
  ██▆██████▇██████▇▇▇▇██▇█▆▆▅▆▇▆▆▆▅▅▅▅▅▅▆▅▄▅▅▅▅▄▅▆▆▆▆▆▆▅▆▆▅▄▅▄ █
  57.5 μs       Histogram: log(frequency) by time      69.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@gbaraldi
Copy link
Member Author

gbaraldi commented Jan 2, 2023

That's doubly weird, it's a shame the mapreduce machinery is probably hiding something here. Does profiling show a difference somewhere?

@giordano
Copy link
Contributor

giordano commented Jan 2, 2023

Ok, it turns out that reduce(max, a) is faster than reduce(llvm_max, a) because reduce(::typeof(max), a) is somehow internally optimised, even if it calls the same reduction operator as reduce(::typeof(some_other_function), a):

julia> @benchmark reduce(max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4939 samples with 1 evaluation.
 Range (min … max):  189.862 μs …  1.354 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     190.961 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   191.515 μs ± 16.729 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▆█▇  ▄▁▂▁▁                                                 ▁
  █████▇▆█████▆▁▃▁▁▁▁▁▁▃▃▁▁▁▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▅▃▅▅▄▅▆ █
  190 μs        Histogram: log(frequency) by time       208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4638 samples with 1 evaluation.
 Range (min … max):  232.332 μs … 285.223 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   233.619 μs ±   2.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▇▇▆▃                                                       ▂
  ████████▃▃▁▁▁▃▆▇▇▇▃█▇▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▅▅▄▁▆▄▄▆▇▆▆▇ █
  232 μs        Histogram: log(frequency) by time        250 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Base.max(x::T, y::T) where {T<:Union{Float32,Float64}} = llvm_max(x, y)

julia> @benchmark reduce(max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4848 samples with 1 evaluation.
 Range (min … max):  186.571 μs … 240.413 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     187.592 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   187.891 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇▂▅█
  ▇████▄▂▃▃▂▂▂▂▂▁▁▁▂▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▁▂▂▂▂▂▂▂▂ ▃
  187 μs           Histogram: frequency by time          205 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reduce(llvm_max, a) setup=(a=randn(100_000)) evals=1
BenchmarkTools.Trial: 4637 samples with 1 evaluation.
 Range (min … max):  232.372 μs …  1.501 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     233.192 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   234.058 μs ± 18.959 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▇▇▅▁                                                      ▂
  ███████▆▁▁▁▁▅███▆▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▅▆▅▄▅▆▄▆▅▆▅▆▇▅ █
  232 μs        Histogram: log(frequency) by time       251 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

After redefinition of Base.max (so basically what this PR does), in principle reduce(max, a) should be the same as reduce(llvm_max, a), but performance is much different, which confused me.

At this point I'd say this PR is a net (marginal) improvement on both CPUs.

@gbaraldi
Copy link
Member Author

gbaraldi commented Jan 2, 2023

Thanks for looking into it!

@giordano
Copy link
Contributor

giordano commented Jan 3, 2023

Failures on Windows are unrelated (#48101), I'm going to merge this, thanks Gabriel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants