Changing the default LU? #357

ChrisRackauckas · 2023-08-08T03:49:07Z

This is a thread for investigating changes to the LU defaults, based off of benchmarks like #356 .

(Note: there's a Mac-specific version 3 posts down)

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), MKLLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf

The justification for RecursiveFactorization.jl still looks very strong from the looks of this.

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 32 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 32

Needs examples on other systems.

The text was updated successfully, but these errors were encountered:

ChrisRackauckas · 2023-08-08T03:49:17Z

OpenBLAS looks drunk.

oscardssmith · 2023-08-08T04:11:54Z

openBLAS is just multithreading way too small. RLFU looks like a good option. Is the (pre)compile time impact reasonable?

ChrisRackauckas · 2023-08-08T04:47:38Z

A Mac version:

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), AppleAccelerateLUFactorization(), FastLUFactorization(), SimpleLUFactorization()]
res = [Float64[] for i in 1:length(algs)]

ns = 4:8:500
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, n, n)
    global b = rand(rng, n)
    global u0= rand(rng, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("lubench.png")
savefig("lubench.pdf")

lubench.pdf

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

oscardssmith · 2023-08-08T05:01:28Z

It's interesting that RFLU does poorly on mac. Does it not know about the lower vectorization width?

ViralBShah · 2023-08-08T12:57:29Z

Ideally you should have a tune or plan API, which can do it on a user's system. Applications that care about it can opt into a tuning run and save these preferences. If something like a switch-able BLAS package gets done, this sort of tuning can be even easier.

vpuri3 · 2023-08-08T14:20:50Z

Running the Mac specific version from above comment

 julia +beta --startup-file=no --proj
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-beta1 (2023-07-25)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(tmp) pkg> st
Status `~/.julia/dev/tmp/Project.toml`
  [13e28ba4] AppleAccelerate v0.4.0
  [6e4b80f9] BenchmarkTools v1.3.2
  [7ed4a6bd] LinearSolve v2.5.0
  [dde4c033] Metal v0.5.0
  [91a5bcdd] Plots v1.38.17
  [3d5dd08c] VectorizationBase v0.21.64

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 8
  JULIA_DEPOT_PATH = /Users/vp/.julia
  JULIA_PKG_DEVDIR = /Users/vp/.julia/dev

[lubench.pdf](https://github.com/SciML/LinearSolve.jl/files/12

292316/lubench.pdf)

oscardssmith · 2023-08-08T14:22:30Z

Julia Version 1.11.0-DEV.235
Commit 9f9e989f24 (2023-08-06 04:35 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4,1

Looks like RFLU is really good for small sizes but isn't doing that well once all your threads have non-trivial sized problems.

ChrisRackauckas · 2023-08-08T14:31:44Z

I wonder if that's a trend of Intel vs AMD. Need more data.

nilshg · 2023-08-08T14:38:17Z

Here is:

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

Might make sense to remove the two lowest performing methods here to speed up benchmarks given they probably won't be chosen?

ChrisRackauckas · 2023-08-08T14:40:14Z

yeah maybe though it doesn't change the plot much and confirms at what point the BLAS matters

oscardssmith · 2023-08-08T14:43:18Z

I wonder if that's a trend of Intel vs AMD. Need more data.

I'm pretty sure it's number of threads. Openblas generally is pretty bad at ramping up the number of threads as size increases and often just goes straight from 1 to full multi-threading. As such on CPUs with lots of cores it performs incredibly badly in the region where it should be using 2-4 cores and is instead using 16.

ChrisRackauckas · 2023-08-08T14:45:25Z

But that gives no explanation to what was actually mentioned, which has no OpenBLAS but is RecursiveFactorization vs MKL and where the cutoff is. From the looks so far, I'd say:

On Mac, default to always using accelerate
On Intel, default to RecursiveFactorization cutoff at n=150 switch to MKL
On AMD, default to RecursiveFactorization no cutoff

and never doing OpenBLAS.

ejmeitz · 2023-08-08T14:50:56Z

When I tried to run this the code errored on the 348 x 348 problem size (twice). Not sure what's up with that since I made a clean tmp environment. Could just be me but thought I'd share.

MethodError: no method matching add_bisecting_if_branches!(::Expr, ::Int64, ::Int64, ::Int64, ::Bool)
The applicable method may be too new: running in world age 35110, while current world is 37210.

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores

DaniGlez · 2023-08-08T14:52:20Z

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7800X3D 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

oscardssmith · 2023-08-08T14:55:34Z

that's interesting. There are now two of us that have a fairly noticeable performance bump in the 350 to 380 region for RFLU. Any ideas as to what could be causing it? It looks like we aren't using more threads soon enough maybe?

ejmeitz · 2023-08-08T14:59:05Z

Considering mine crashes in the RFLU at 350, something definitely going on there.

ChrisRackauckas · 2023-08-08T15:01:15Z

@chriselrod

chriselrod · 2023-08-08T15:11:57Z

Multithreaded and singlethreaded are going to look very different.

MKL does very well multithreaded, while OpenBLAS is awful (as has been discussed).

RF does not scale well with multiple threads. That is a known issue, but Yingbo and I never had time to address it.

chriselrod · 2023-08-08T15:12:38Z

@ejmeitz, you aren't doing something weird like not using precompiled modules, are you?

ejmeitz · 2023-08-08T15:16:12Z

When I added the packages it spammed the message below. I started from a clean env so I just kind of ignored the messages, but that is probably the issue. I did run precompile after adding all the packages just to be sure though.

┌ Warning: Module VectorizationBase with build ID fafbfcfd-2196-d9ff-0000-9410f7322d5a is missing from the cache.
│ This may mean VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f] does not support precompilation but is imported by a module that does.

chriselrod · 2023-08-08T15:16:41Z

Nuke your precompile cache and try again.

$ rm -rf ~/.julia/compiled/

ejmeitz · 2023-08-08T15:44:54Z

That fixed it, thanks! RFLU seems better on my machine for longer than some of the others.

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 40 on 80 virtual cores

chriselrod · 2023-08-08T15:46:28Z

julia> versioninfo()
Julia Version 1.11.0-DEV.238
Commit 8b8da91ad7 (2023-08-08 01:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 53 on 36 virtual cores

ChrisRackauckas · 2023-08-08T15:48:09Z

@ejmeitz your GFLOPs are generally very low, getting stomped by much cheaper CPUs. Maybe it's the clock rate?

chriselrod · 2023-08-08T15:51:22Z

Likely, because of the poor multithreaded scaling.

The big winner here is the Ryzen 7800X3D. It probably wants more multithreading to kick in at a much smaller size.

ejmeitz · 2023-08-08T15:54:08Z

I noticed that too. I also thought it would be the clocks (max turbo is 4 GHz) but it still felt low to me. Probably a combo of poor multithread scaling and the clocks being low. I can run it on 128 core AMD CPU if you'd be curious to see that data.

ChrisRackauckas · 2023-08-08T15:54:54Z

Definitely curious now.

Leticia-maria · 2023-08-08T16:00:31Z

I have run (sorry for the delay, had to install/upgrade some dependencies):

Julia Version 1.10.0-alpha1
Commit f8ad15f7b16 (2023-07-06 10:36 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  LD_LIBRARY_PATH = 
  JULIA_NUM_THREADS = 4

mastrof · 2023-08-08T16:30:52Z

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 16 on 16 virtual cores

This is instead what I get running julia with -t 1

albheim · 2023-08-09T08:10:14Z

Using single julia thread

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Using julia -t auto

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 11 on 8 virtual cores

lgoettgens · 2023-08-09T09:37:14Z

single thread

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

julia -t auto

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores

ViralBShah · 2023-08-09T12:15:14Z

@ChrisRackauckas Where can I find out what each of the factorizations does?

chriselrod · 2023-08-09T12:26:16Z

I think we may want

@static if Sys.ARCH === :x86_64
const libMKL = MKL_jll.libmkl_rt # more convenient name
function mkl_set_num_threads(N::Integer)
    ccall((:MKL_Set_Num_Threads,libMKL), Cvoid, (Int32,), N % Int32)
end
mkl_set_num_threads(Threads.nthreads())
end

for single threaded comparisons.

Single threaded performance where we actually restrict the solves to a single thread would likely be useful for ensemble solves.

ChrisRackauckas · 2023-08-09T12:28:18Z

@ViralBShah just the standard docs https://docs.sciml.ai/LinearSolve/stable/solvers/solvers/. Accelerate isn't in there yet though. I'll PR with Metal.jl as an option too when I land.

ViralBShah · 2023-08-09T12:28:20Z

Note that the way to get/set MKL threads should be through the domain API: JuliaLinearAlgebra/libblastrampoline#74

LBT doesn't do that but should be easy to do it here.

chriselrod · 2023-08-09T14:20:01Z

Note that the way to get/set MKL threads should be through the domain API: JuliaLinearAlgebra/libblastrampoline#74

LBT doesn't do that but should be easy to do it here.

Okay, so we should use something like

using MKL_jll
mkl_blas_set_num_threads(numthreads::Int) =
           Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
           Cuint, (Cint,Cint), numthreads, 1))

Or, more elaborately

using MKL_jll
mkl_set_num_threads(numthreads::Int, domain::Cint = zero(Cint)) =
           Bool(ccall((:MKL_Domain_Set_Num_Threads, MKL_jll.libmkl_rt),
           Cuint, (Cint,Cint), numthreads, domain))


const MKL_DOMAIN_ALL = Cint(0)
const MKL_DOMAIN_BLAS = Cint(1)
const MKL_DOMAIN_FFT = Cint(2)
const MKL_DOMAIN_VML = Cint(3)
const MKL_DOMAIN_PARDISO = Cint(4)

mkl_set_num_blas_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_BLAS)
mkl_set_num_fft_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_FFT)
mkl_set_num_vml_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_VML)
mkl_set_num_pardiso_threads(numthreads) = 
  mkl_set_num_threads(numthreads, MKL_DOMAIN_PARDISO)

chriselrod · 2023-08-09T14:43:18Z

Single threaded on the 10980XE.
Multithreaded: #357 (comment)

Note that RFLU did not benefit from multithreading until like 450x450. It was hurt below that.

ChrisRackauckas · 2023-08-09T16:43:06Z

I setup Metal.jl:

#361

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, Metal
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [AppleAccelerateLUFactorization(), MetalLUFactorization()]
res = [Float32[] for i in 1:length(algs)]

ns = 200:600:15000
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, Float32, n, n)
    global b = rand(rng, Float32, n)
    global u0= rand(rng, Float32, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        GC.gc()
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("metal_large_lubench.png")
savefig("metal_large_lubench.pdf")

metallubench.pdf

metal_large_lubench.pdf

ChrisRackauckas · 2023-08-09T17:07:25Z

Can I get some results of folks doing CUDA offloading with the following script?

using BenchmarkTools, Random, VectorizationBase
using LinearAlgebra, LinearSolve, CUDA, MKL_jll
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

function luflop(m, n = m; innerflop = 2)
    sum(1:min(m, n)) do k
        invflop = 1
        scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m)
        updateflop = isempty((k + 1):n) ? 0 :
                     sum((k + 1):n) do j
            isempty((k + 1):m) ? 0 : sum((k + 1):m) do i
                innerflop
            end
        end
        invflop + scaleflop + updateflop
    end
end

algs = [MKLLUFactorization(), CUDAOffloadFactorization()]
res = [Float32[] for i in 1:length(algs)]

ns = 200:400:10000
for i in 1:length(ns)
    n = ns[i]
    @info "$n × $n"
    rng = MersenneTwister(123)
    global A = rand(rng, Float32, n, n)
    global b = rand(rng, Float32, n)
    global u0= rand(rng, Float32, n)
    
    for j in 1:length(algs)
        bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true))
        push!(res[j], luflop(n) / bt / 1e9)
    end
end

using Plots
__parameterless_type(T) = Base.typename(T).wrapper
parameterless_type(x) = __parameterless_type(typeof(x))
parameterless_type(::Type{T}) where {T} = __parameterless_type(T)

p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright)
for i in 2:length(res)
    plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i]))))
end
p

savefig("cudaoffloadlubench.png")
savefig("cudaoffloadlubench.pdf")

joelandman · 2023-08-09T17:23:34Z

julia> algs = [MKLLUFactorization(), CUDAOffloadFactorization()]
ERROR: UndefVarError: CUDAOffloadFactorization not defined
Stacktrace:
[1] top-level scope
@ REPL[8]:1
[2] top-level scope
@ ~/.julia/packages/CUDA/tVtYo/src/initialization.jl:185

(@v1.9) pkg> st CUDA
Status ~/.julia/environments/v1.9/Project.toml
[052768ef] CUDA v4.4.0

Is this CUDA.jl from the repo tip?

ChrisRackauckas · 2023-08-09T17:24:24Z

It should load when you do using CUDA because it's an extension library.

christiangnrd · 2023-08-09T17:31:28Z

@joelandman @ChrisRackauckas There's a capitalization typo in the benchmark it should be CudaOffloadFactorization.

chriselrod · 2023-08-09T17:36:32Z

[ Info: 200 × 200
ERROR: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})

Closest candidates are:
  getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
   @ LinearSolveMKLExt ~/.julia/packages/LinearSolve/Tcmzb/ext/LinearSolveMKLExt.jl:13

carstenbauer · 2023-08-09T17:43:06Z

I can run the snippet on A100 and A40. However, I get

ERROR: LoadError: UndefVarError: `MKLLUFactorization` not defined
Stacktrace:
 [1] top-level scope
   @ /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21
 [2] include(fname::String)
   @ Base.MainInclude ./client.jl:478
 [3] top-level scope
   @ REPL[1]:1
in expression starting at /scratch/pc2-mitarbeiter/bauerc/playground/linearsolvetest/script.jl:21

Update: With LinearSolve#main I get the same error as @chriselrod above:

julia> include("script.jl")
[ Info: 200 × 200
ERROR: LoadError: MethodError: no method matching getrf!(::Matrix{Float32}; ipiv::Vector{Int64}, info::Base.RefValue{Int64})

Closest candidates are:
  getrf!(::AbstractMatrix{<:Float64}; ipiv, info, check)
   @ LinearSolveMKLExt /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:13

Stacktrace:
  [1] #solve!#2
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:45 [inlined]
  [2] solve!
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/ext/LinearSolveMKLExt.jl:39 [inlined]
  [3] #solve!#6
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:197 [inlined]
  [4] solve!
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:196 [inlined]
  [5] #solve#5
    @ /scratch/pc2-mitarbeiter/bauerc/.julia/packages/LinearSolve/KDj8F/src/common.jl:193 [inlined]
  [6] solve
[...]

joelandman · 2023-08-09T17:49:46Z

I threw together a quick patch for LinearSolverMKLExt.jl to accomodate the Float32 version. @ChrisRackauckas please let me know if you want a PR or a patch for it. Results incoming (running now)

ChrisRackauckas · 2023-08-09T17:59:16Z

I put an MKL 32-bit patch into the MKL PR #361. I noticed that it's not using the MKL backsolve so that could potentially make that a bit faster, but it shouldn't effect the CUDA cutoff point.

joelandman · 2023-08-09T18:03:04Z

cudaoffloadlubench.pdf

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1621 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+

joelandman · 2023-08-09T18:03:48Z

Same zen2 laptop

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 16 on 16 virtual cores
Environment:
LD_LIBRARY_PATH =
JULIA_HOME = /home/joe/local

carstenbauer · 2023-08-09T18:14:49Z

Julia 1.9.2 (1 Julia thread)

A40 + Intel(R) Xeon(R) Gold 6148F:

A100 + AMD EPYC 7742:

tylerjthomas9 · 2023-08-09T18:22:02Z

RTX 3090 + 5950x

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

NVIDIA-SMI 525.125.06   
Driver Version: 525.125.06   
CUDA Version: 12.0
NVIDIA GeForce RTX 3090 (420W)

A6000 ADA + EPYC 7713

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 256 × AMD EPYC 7713 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 256 virtual cores

NVIDIA-SMI 535.86.05              
Driver Version: 535.86.05    
CUDA Version: 12.2 
NVIDIA RTX 6000 Ada (300W)

zygi · 2023-08-09T18:40:50Z

i9-13900K, RTX 4090

Versioninfo:

Details

``` julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e909 (2023-07-05 09:39 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, goldmont) Threads: 1 on 32 virtual cores ``` ``` julia> CUDA.versioninfo() CUDA runtime 12.1, artifact installation CUDA driver 12.2 NVIDIA driver 535.86.5

CUDA libraries:

CUBLAS: 12.1.3
CURAND: 10.3.2
CUFFT: 11.0.2
CUSOLVER: 11.4.5
CUSPARSE: 12.1.0
CUPTI: 18.0.0
NVML: 12.0.0+535.86.5

Julia packages:

CUDA: 4.4.0
CUDA_Driver_jll: 0.5.0+1
CUDA_Runtime_jll: 0.6.0+0

Toolchain:

Julia: 1.9.2
LLVM: 14.0.6
PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce RTX 4090 (sm_89, 18.567 GiB / 23.988 GiB available)



</p>
</details>

christiangnrd · 2023-08-09T18:44:17Z

All running julia 1.9.2

RTX 3060 + 3700X

M2 Max 30-core GPU

chriselrod · 2023-08-09T21:14:00Z

I was hoping that these benchmarks would show that we should drop RecursiveFactorization from the defaults now that we have MKL, but they just don't show that.

Another 8% improvement in RF coming up:
JuliaLinearAlgebra/RecursiveFactorization.jl#84
The butterfly may help even more, when we get around to it.

ChrisRackauckas · 2023-08-10T08:44:45Z

It looks like GPU offloading doesn't make sense once things are using MKL. Cutoff is >1000

chriselrod · 2023-08-11T03:15:20Z

Accidentally forgot to do --startup=no.
By the looks of it, I'd need a 4090 to beat this CPU.

mikeingold · 2023-09-26T01:37:48Z

For what it's worth, a similar system to the given example but a generation older and apparently running fewer threads (16 vs 32).

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 16 on 32 virtual cores

Environment:
  JULIA_NUM_THREADS = 16

ChrisRackauckas · 2023-10-26T07:24:13Z

Thanks everyone, the defaults take these into account. Now MKL is defaulted to in many scenarios (along with AppleAccelerate on macs), with a switch at 200 which seems to be a roughly optimal spot to go from RFLU to MKL

ChrisRackauckas mentioned this issue Aug 8, 2023

Sparse LU benchmarking #359

Closed

ChrisRackauckas closed this as completed Oct 26, 2023

chriselrod mentioned this issue Jan 10, 2024

LoopVectorization.jl causing segfaults on 1.11 JuliaSIMD/LoopVectorization.jl#525

Closed

Changing the default LU? #357

Changing the default LU? #357

Comments

ChrisRackauckas commented Aug 8, 2023 • edited Loading

ChrisRackauckas commented Aug 8, 2023

oscardssmith commented Aug 8, 2023

ChrisRackauckas commented Aug 8, 2023 • edited Loading

oscardssmith commented Aug 8, 2023

ViralBShah commented Aug 8, 2023 • edited Loading

vpuri3 commented Aug 8, 2023

oscardssmith commented Aug 8, 2023 • edited Loading

ChrisRackauckas commented Aug 8, 2023

nilshg commented Aug 8, 2023

ChrisRackauckas commented Aug 8, 2023

oscardssmith commented Aug 8, 2023

ChrisRackauckas commented Aug 8, 2023

ejmeitz commented Aug 8, 2023

DaniGlez commented Aug 8, 2023

oscardssmith commented Aug 8, 2023

ejmeitz commented Aug 8, 2023 • edited Loading

ChrisRackauckas commented Aug 8, 2023

chriselrod commented Aug 8, 2023

chriselrod commented Aug 8, 2023

ejmeitz commented Aug 8, 2023

chriselrod commented Aug 8, 2023 • edited Loading

ejmeitz commented Aug 8, 2023 • edited Loading

chriselrod commented Aug 8, 2023 • edited Loading

ChrisRackauckas commented Aug 8, 2023

chriselrod commented Aug 8, 2023 • edited Loading

ejmeitz commented Aug 8, 2023

ChrisRackauckas commented Aug 8, 2023

Leticia-maria commented Aug 8, 2023

mastrof commented Aug 8, 2023

albheim commented Aug 9, 2023

lgoettgens commented Aug 9, 2023

ViralBShah commented Aug 9, 2023

chriselrod commented Aug 9, 2023

ChrisRackauckas commented Aug 9, 2023

ViralBShah commented Aug 9, 2023

chriselrod commented Aug 9, 2023 • edited Loading

chriselrod commented Aug 9, 2023 • edited Loading

ChrisRackauckas commented Aug 9, 2023 • edited Loading

ChrisRackauckas commented Aug 9, 2023

joelandman commented Aug 9, 2023

ChrisRackauckas commented Aug 9, 2023

christiangnrd commented Aug 9, 2023

chriselrod commented Aug 9, 2023

carstenbauer commented Aug 9, 2023 • edited Loading

joelandman commented Aug 9, 2023

ChrisRackauckas commented Aug 9, 2023

joelandman commented Aug 9, 2023

joelandman commented Aug 9, 2023

carstenbauer commented Aug 9, 2023 • edited Loading

tylerjthomas9 commented Aug 9, 2023

RTX 3090 + 5950x

A6000 ADA + EPYC 7713

zygi commented Aug 9, 2023

i9-13900K, RTX 4090

christiangnrd commented Aug 9, 2023

RTX 3060 + 3700X

M2 Max 30-core GPU

chriselrod commented Aug 9, 2023

ChrisRackauckas commented Aug 10, 2023

chriselrod commented Aug 11, 2023 • edited Loading

mikeingold commented Sep 26, 2023

ChrisRackauckas commented Oct 26, 2023

ChrisRackauckas commented Aug 8, 2023 •

edited

Loading

ChrisRackauckas commented Aug 8, 2023 •

edited

Loading

ViralBShah commented Aug 8, 2023 •

edited

Loading

oscardssmith commented Aug 8, 2023 •

edited

Loading

ejmeitz commented Aug 8, 2023 •

edited

Loading

chriselrod commented Aug 8, 2023 •

edited

Loading

ejmeitz commented Aug 8, 2023 •

edited

Loading

chriselrod commented Aug 8, 2023 •

edited

Loading

chriselrod commented Aug 8, 2023 •

edited

Loading

chriselrod commented Aug 9, 2023 •

edited

Loading

chriselrod commented Aug 9, 2023 •

edited

Loading

ChrisRackauckas commented Aug 9, 2023 •

edited

Loading

carstenbauer commented Aug 9, 2023 •

edited

Loading

carstenbauer commented Aug 9, 2023 •

edited

Loading

chriselrod commented Aug 11, 2023 •

edited

Loading