-
-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Krylov.jl optimizations for ODE usage #1563
Comments
@chriselrod we can continue the discussion from #1551 (comment) here. The example above highlights the vs Sundials for each of the cases. The ILU case is the one that is back and forward substitution limited. The Krylov.jl without preconditioners case is the one that is a bit BLAS fishy. The AMG case, I don't know why that one is so slow without Jacobi smoothing |
The |
No, it's like an order of magnitude faster than the IterativeSolvers.jl one. IterativeSolvers.jl isn't even in the running. |
To optimize the performance of Krylov methods, you could use |
#1551 (comment) has a bunch of examples. IterativeSolvers.jl is just aggressively bad here and I haven't had time to look into it, but it was far enough away that I wasn't going to spend anymore time with it. Krylov.jl is much better, but I did see a few things: 1/3 of the time was in the |
@time solve(prob_ode_brusselator_2d,Rosenbrock23(linsolve=KrylovJL_GMRES()),save_everystep=false);
# 0.891711 seconds (230.15 k allocations: 184.457 MiB, 5.10% gc time)
@time solve(prob_ode_brusselator_2d,Rosenbrock23(linsolve=IterativeSolversJL_GMRES()),save_everystep=false);
# 6.231675 seconds (1.37 M allocations: 206.219 MiB, 0.28% gc time)
@time solve(prob_ode_brusselator_2d,Rodas4(linsolve=IterativeSolversJL_GMRES()),save_everystep=false);
# 8.095367 seconds (1.82 M allocations: 296.098 MiB, 0.38% gc time)
@time solve(prob_ode_brusselator_2d,Rodas4(linsolve=KrylovJL_GMRES()),save_everystep=false);
# 1.215097 seconds (317.01 k allocations: 152.230 MiB, 3.26% gc time) |
I will do some tests with a pure Julia |
Specifically, this is the issue that can really hurt users of default Julia installations: JuliaLang/julia#33409 |
You opened the issue in 2019 and it's still not fixed ?! 😞 |
I did some benchmarks with
Code used for the benchmarks:
|
Do we have an efficient Julia implementation of |
function dot_julia(x,y)
s = zero(promote_eltype(x, y))
@fastmath for i in eachindex(x, y)
s += x[i]'*y[i]
end
return s
end is fairly good, but will probably fall behind for large sizes vs multithreaded on x86 CPUs, because they tend to get better total memory bandwidth when more than 1 core is active. |
|
It definitely depends on the CPU. ┌───────────┬─────────────────────┬─────────────────────┬───────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │
├───────────┼─────────────────────┼─────────────────────┼───────────┤
│ 10.0 │ 7.95e-5 │ 8.2e-5 │ 1.1e-5 │
│ 100.0 │ 6.31e-5 │ 6.22e-5 │ 2.06e-5 │
│ 1000.0 │ 0.0001188 │ 0.0001227 │ 0.0001226 │
│ 10000.0 │ 0.0011194 │ 0.001119 │ 0.0007435 │
│ 100000.0 │ 0.0421721 │ 0.0706043 │ 0.0024566 │
│ 1.0e6 │ 0.0681789 │ 0.0977464 │ 0.0242952 │
│ 1.0e7 │ 3.81686 │ 3.74455 │ 3.88031 │
└───────────┴─────────────────────┴─────────────────────┴───────────┘
┌───────────┬─────────────────────┬─────────────────────┬─────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │
├───────────┼─────────────────────┼─────────────────────┼─────────┤
│ 10.0 │ 7.22727 │ 7.45455 │ 1.0 │
│ 100.0 │ 3.06311 │ 3.01942 │ 1.0 │
│ 1000.0 │ 1.0 │ 1.03283 │ 1.03199 │
│ 10000.0 │ 1.50558 │ 1.50504 │ 1.0 │
│ 100000.0 │ 17.1669 │ 28.7407 │ 1.0 │
│ 1.0e6 │ 2.80627 │ 4.02328 │ 1.0 │
│ 1.0e7 │ 1.01931 │ 1.0 │ 1.03626 │
└───────────┴─────────────────────┴─────────────────────┴─────────┘ julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3)
Environment:
JULIA_EDITOR = "C:\Users\accou\AppData\Local\atom\app-1.58.0\atom.exe" -a
JULIA_NUM_THREADS = 32 |
┌───────────┬─────────────────────┬─────────────────────┬───────────┬───────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │ Julia │
├───────────┼─────────────────────┼─────────────────────┼───────────┼───────────┤
│ 10.0 │ 7.95e-5 │ 8.2e-5 │ 1.1e-5 │ 3.3125e-6 │
│ 100.0 │ 6.31e-5 │ 6.22e-5 │ 2.06e-5 │ 3.45e-5 │
│ 1000.0 │ 0.0001188 │ 0.0001227 │ 0.0001226 │ 0.0003104 │
│ 10000.0 │ 0.0011194 │ 0.001119 │ 0.0007435 │ 0.003095 │
│ 100000.0 │ 0.0421721 │ 0.0706043 │ 0.0024566 │ 0.0313813 │
│ 1.0e6 │ 0.0681789 │ 0.0977464 │ 0.0242952 │ 0.315843 │
│ 1.0e7 │ 3.81686 │ 3.74455 │ 3.88031 │ 3.15861 │
└───────────┴─────────────────────┴─────────────────────┴───────────┴───────────┘
┌───────────┬─────────────────────┬─────────────────────┬─────────┬─────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │ Julia │
├───────────┼─────────────────────┼─────────────────────┼─────────┼─────────┤
│ 10.0 │ 24.0 │ 24.7547 │ 3.32075 │ 1.0 │
│ 100.0 │ 3.06311 │ 3.01942 │ 1.0 │ 1.67476 │
│ 1000.0 │ 1.0 │ 1.03283 │ 1.03199 │ 2.61279 │
│ 10000.0 │ 1.50558 │ 1.50504 │ 1.0 │ 4.16274 │
│ 100000.0 │ 17.1669 │ 28.7407 │ 1.0 │ 12.7743 │
│ 1.0e6 │ 2.80627 │ 4.02328 │ 1.0 │ 13.0002 │
│ 1.0e7 │ 1.2084 │ 1.18551 │ 1.22849 │ 1.0 │
└───────────┴─────────────────────┴─────────────────────┴─────────┴─────────┘ |
I did other benchmarks yesterday with complex numbers for |
Updated script with some fixes: using BenchmarkTools, LinearAlgebra, LoopVectorization
# `@noinline`, otherwise the compiler optimizes it away
@noinline function dot_julia(x,y)
s = zero(Base.promote_eltype(x, y))
# I thought Julia's bound check elimination pass would be able
# to remove bounds checks because of the `eachindex`, but
@fastmath for i in eachindex(x, y)
@inbounds s += x[i]'*y[i]
end
return s
end
function dot_tturbo(x,y)
s = zero(Base.promote_eltype(x, y))
@tturbo for i in eachindex(x, y)
s += x[i]*y[i]
end
return s
end
krylov_dot(n :: Integer, x :: Vector{T}, dx :: Integer, y :: Vector{T}, dy :: Integer) where T <: BLAS.BlasReal = BLAS.dot(n, x, dx, y, dy)
p = 7
time_openblas1 = zeros(p);
time_openblas2 = zeros(p);
time_mkl = zeros(p);
time_julia = zeros(p);
time_tturbo = zeros(p);
function belapsed_time(br)
mintime = BenchmarkTools.time(minimum(br))
allocs = BenchmarkTools.allocs(br)
println(" ", (BenchmarkTools).prettytime(mintime), " (", allocs, " allocation", if allocs == 1
""
else
"s"
end, ": ", (BenchmarkTools).prettymemory((BenchmarkTools).memory(br)), ")")
return mintime
end
for i = 1:p
n = 10^i
println("Dimension: ", n)
T = Float64
x = rand(T, n)
y = rand(T, n)
dx = 1
dy = 1
b = @benchmark for k in 1:1000 krylov_dot($n, $x, $dx, $y, $dy) end
time_openblas1[i] = belapsed_time(b)
end
NMAX = Sys.CPU_THREADS
N = Int(NMAX / 2)
BLAS.set_num_threads(N)
for i = 1:p
n = 10^i
println("Dimension: ", n)
T = Float64
x = rand(T, n)
y = rand(T, n)
dx = 1
dy = 1
b = @benchmark for k in 1:1000 krylov_dot($n, $x, $dx, $y, $dy) end
time_openblas2[i] = belapsed_time(b)
end
using MKL
for i = 1:p
n = 10^i
println("Dimension: ", n)
T = Float64
x = rand(T, n)
y = rand(T, n)
dx = 1
dy = 1
b = @benchmark for k in 1:1000 krylov_dot($n, $x, $dx, $y, $dy) end
time_mkl[i] = belapsed_time(b)
end
for i = 1:p
n = 10^i
println("Dimension: ", n)
T = Float64
x = rand(T, n)
y = rand(T, n)
dx = 1
dy = 1
b = @benchmark for k in 1:1000 dot_julia($x, $y) end
time_julia[i] = belapsed_time(b)
end
for i = 1:p
n = 10^i
println("Dimension: ", n)
T = Float64
x = rand(T, n)
y = rand(T, n)
dx = 1
dy = 1
b = @benchmark for k in 1:1000 dot_tturbo($x, $y) end
time_tturbo[i] = belapsed_time(b)
end
using PrettyTables
dimension = [10^i for i=1:p];
time = hcat(dimension, time_openblas1, time_openblas2, time_mkl, time_julia, time_tturbo)
pretty_table(time ; header = ["Dimension", "OpenBLAS $NMAX threads", "OpenBLAS $N threads", "MKL", "Julia", "@tturbo"])
for i = 1:p
v = @view(time[i,2:end])
v ./= minimum(v)
end
pretty_table(time ; header = ["Dimension", "OpenBLAS $NMAX threads", "OpenBLAS $N threads", "MKL", "Julia", "@tturbo"])
versioninfo() I get: ┌───────────┬─────────────────────┬─────────────────────┬───────────┬───────────┬───────────┐
│ Dimension │ OpenBLAS 36 threads │ OpenBLAS 18 threads │ MKL │ Julia │ @tturbo │
├───────────┼─────────────────────┼─────────────────────┼───────────┼───────────┼───────────┤
│ 10.0 │ 8624.33 │ 8708.33 │ 14244.0 │ 8806.0 │ 9252.0 │
│ 100.0 │ 12004.0 │ 11990.0 │ 15962.0 │ 12749.0 │ 12477.0 │
│ 1000.0 │ 55191.0 │ 55050.0 │ 54769.0 │ 55958.0 │ 50141.0 │
│ 10000.0 │ 905247.0 │ 907152.0 │ 1.33786e6 │ 916908.0 │ 921683.0 │
│ 100000.0 │ 4.74124e6 │ 4.11024e6 │ 2.74366e6 │ 5.47155e7 │ 3.47289e6 │
│ 1.0e6 │ 9.28879e7 │ 1.85967e7 │ 1.6299e7 │ 6.25422e8 │ 6.52952e7 │
│ 1.0e7 │ 2.13776e9 │ 2.08145e9 │ 2.08633e9 │ 1.2057e10 │ 2.08374e9 │
└───────────┴─────────────────────┴─────────────────────┴───────────┴───────────┴───────────┘
┌───────────┬─────────────────────┬─────────────────────┬─────────┬─────────┬─────────┐
│ Dimension │ OpenBLAS 36 threads │ OpenBLAS 18 threads │ MKL │ Julia │ @tturbo │
├───────────┼─────────────────────┼─────────────────────┼─────────┼─────────┼─────────┤
│ 10.0 │ 1.0 │ 1.00974 │ 1.65161 │ 1.02106 │ 1.07278 │
│ 100.0 │ 1.00117 │ 1.0 │ 1.33128 │ 1.0633 │ 1.04062 │
│ 1000.0 │ 1.10072 │ 1.0979 │ 1.0923 │ 1.11601 │ 1.0 │
│ 10000.0 │ 1.0 │ 1.0021 │ 1.4779 │ 1.01288 │ 1.01816 │
│ 100000.0 │ 1.72807 │ 1.49809 │ 1.0 │ 19.9425 │ 1.26579 │
│ 1.0e6 │ 5.69899 │ 1.14097 │ 1.0 │ 38.3718 │ 4.00608 │
│ 1.0e7 │ 1.02705 │ 1.0 │ 1.00235 │ 5.79261 │ 1.0011 │
└───────────┴─────────────────────┴─────────────────────┴─────────┴─────────┴─────────┘
julia> versioninfo()
Julia Version 1.7.2-pre.0
Commit 3f024fd0ab* (2021-12-23 18:27 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
julia> using CPUSummary
julia> CPUSummary.cache_size(Val(2)) * CPUSummary.num_l2cache() / sizeof(Float64)
2.359296e6 just under 2.36 Instead, I could make LV consider this in deciding how many cores to use... I should probably remove the loops in the above script. |
@amontoison what CPU are you using? Maybe part of it is that OpenBLAS is just exceedingly awful on Ryzen, while MKL is just slacking on non-Intel chips. ┌───────────┬─────────────────────┬─────────────────────┬───────────┬───────────┬───────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │ Julia │ @tturbo │
├───────────┼─────────────────────┼─────────────────────┼───────────┼───────────┼───────────┤
│ 10.0 │ 75100.0 │ 80600.0 │ 10800.0 │ 7050.0 │ 8733.33 │
│ 100.0 │ 68200.0 │ 67300.0 │ 20200.0 │ 10200.0 │ 13000.0 │
│ 1000.0 │ 117800.0 │ 121400.0 │ 124800.0 │ 60100.0 │ 60200.0 │
│ 10000.0 │ 1.1182e6 │ 1.1215e6 │ 749100.0 │ 1.0459e6 │ 642200.0 │
│ 100000.0 │ 4.62663e7 │ 9.06275e7 │ 2.5587e6 │ 1.21059e7 │ 2.0268e6 │
│ 1.0e6 │ 7.3046e7 │ 1.04869e8 │ 2.50255e7 │ 1.23583e8 │ 2.11222e7 │
│ 1.0e7 │ 3.53264e9 │ 3.69242e9 │ 3.88145e9 │ 4.29649e9 │ 3.70713e9 │
└───────────┴─────────────────────┴─────────────────────┴───────────┴───────────┴───────────┘
┌───────────┬─────────────────────┬─────────────────────┬─────────┬─────────┬─────────┐
│ Dimension │ OpenBLAS 32 threads │ OpenBLAS 16 threads │ MKL │ Julia │ @tturbo │
├───────────┼─────────────────────┼─────────────────────┼─────────┼─────────┼─────────┤
│ 10.0 │ 10.6525 │ 11.4326 │ 1.53191 │ 1.0 │ 1.23877 │
│ 100.0 │ 6.68627 │ 6.59804 │ 1.98039 │ 1.0 │ 1.27451 │
│ 1000.0 │ 1.96007 │ 2.01997 │ 2.07654 │ 1.0 │ 1.00166 │
│ 10000.0 │ 1.7412 │ 1.74634 │ 1.16646 │ 1.62862 │ 1.0 │
│ 100000.0 │ 22.8273 │ 44.7146 │ 1.26243 │ 5.97291 │ 1.0 │
│ 1.0e6 │ 3.45826 │ 4.96486 │ 1.1848 │ 5.85086 │ 1.0 │
│ 1.0e7 │ 1.0 │ 1.04523 │ 1.09874 │ 1.21623 │ 1.04939 │
└───────────┴─────────────────────┴─────────────────────┴─────────┴─────────┴─────────┘
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3)
Environment:
JULIA_EDITOR = "C:\Users\accou\AppData\Local\atom\app-1.58.0\atom.exe" -a
JULIA_NUM_THREADS = 32 |
If you were on Julia 1.6, this would be the case. Julia 1.6.5 used OpenBLAS 0.3.10: If you don't mind building from source, you could try something like this in your USE_BINARYBUILDER_OPENBLAS=0
OPENBLAS_TARGET_ARCH=HASWELL to see if pretending to be some other CPU makes OpenBLAS better. |
@ChrisRackauckas I have an Intel i5-6200U (Skylake architecture)
|
The SpMVs of SuiteSparse:GraphBLAS can be 3+ times faster than MKLSparse (and probably even better on Ryzen although that's untested). What do I need to do to use the SpMV from SuiteSparseGraphBLAS.jl in Krylov? I'd prefer not to commit piracy like MKLSparse.jl unless I have to. |
We could change all |
@Wimmerer |
If you open a PR, I can easily use our |
can you print # of Krylov iterations? I'd like to make sure we all solvers are doing the same # of iterations. |
#1571 is about ordinarydiffeq passing verbose argument to linearsolve to print krylov solve stats |
You can check sol.destats at the end to see the number of |
MWE of preconditioned Newton-Krylov solves with OrdinaryDiffEq and Sundials:
Sundials.jl is still slightly ahead of OrdinaryDiffEq here, but within 2x once preconditioners are applied (and preconditioners are a hell of a lot easier to define in OrdinaryDiffEq, so in some sense it's ahead). The non-preconditioned case is still ~3.5x difference though, which points to some missing optimizations in Krylov.jl, though even the preconditioners could use some optimizations from what I can tell.
The text was updated successfully, but these errors were encountered: