-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLIS Append 64_ Suffix to All F77 Exported #4463
Conversation
Append 64_ suffix to all F77 exported routines.
Nice: julia> using LinearAlgebra
julia> peakflops(5000)
1.606979535336095e11
julia> BLAS.lbt_forward("./libblis.so", clear=true)
155
julia> peakflops(5000)
3.5732036026861916e10 Edit: I realised after posting that BLIS is slower, didn't notice the different order of magnitude 😬 Although for other operations OpenBLAS is faster: julia> using BenchmarkTools
julia> LinearAlgebra.__init__()
julia> @benchmark BLAS.axpy!(a, x, y) setup=(T=Float32; N=Int(1e6); a=randn(T); x=randn(T, N); y=randn(T, N)) evals=1
BenchmarkTools.Trial: 208 samples with 1 evaluation.
Range (min … max): 123.510 μs … 4.651 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 176.543 μs ┊ GC (median): 0.00%
Time (mean ± σ): 262.716 μs ± 386.684 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█▃▁
████▆▄▃▃▄▃▂▄▃▄▃▃▃▄▂▂▃▁▁▂▁▁▁▂▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
124 μs Histogram: frequency by time 1.15 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> BLAS.lbt_forward("./libblis.so", clear=true)
155
julia> @benchmark BLAS.axpy!(a, x, y) setup=(T=Float32; N=Int(1e6); a=randn(T); x=randn(T, N); y=randn(T, N)) evals=1
BenchmarkTools.Trial: 412 samples with 1 evaluation.
Range (min … max): 330.814 μs … 761.048 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 484.763 μs ┊ GC (median): 0.00%
Time (mean ± σ): 476.162 μs ± 86.159 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂ ▁ ▇▆▅▆ ▄ ▁▁ ▁▁
████▇█▅▆▄▃▃▁▄▃▃▃▁▃▄▅▆▅█████▆▆██████████▅▅▆▅▅▄▃▁▁▃▃▃▃▁▁▃▁▁▃▁▁▃ ▄
331 μs Histogram: frequency by time 700 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Does BLIS doe runtime detection of features on all architectures? In particular I'm interested in SVE for A64FX, I saw you worked on that. |
@giordano Thanks for the info. Now BLIS has no specialized optimization for level-2 BLAS operations I guess that where the slowdown comes from. SVE is not compiled for now due to |
Ok. We do have support for multiple microarchitectures, but still we need to flesh out some details, and I need to fix some compiler flags for aarch64. With JuliaLang/julia#44194 we'll eventually be able to target A64FX, too. While we're here: do you happen to know whether A64FX requires AES? 🙂 |
Wait, would BLIS build a "fat" library for all the targets into a single file, like OpenBLAS does? Because in that case it's ok to disable the check for |
I'm afraid I do not know about this.
Exactly. Those asm compiled with |
Ok, then you can add
|
Excellent! Do I need to somehow stick to GCC 8 for max compatibility? Or can I push GCC to 10 for Both approaches would work for SVE processors though. |
The main compatibility concern we usually have is when compiling C++ code which would end up requiring a too new libstdc++ at runtime. However I don't see symbols tagged with GLIBCXX in libblis: % nm libblis.so|grep GLIBCXX
% so I think it should be ok to use GCC 10 for this. We also have GCC 11. |
BLIS uses C only. Upgrading to GCC 10 would save me source screening work then. Thanks.
Seen when compiling for aarch64-apple 😉. Nice work! |
Append 64_ suffix to all F77 exported routines.
Resolves JuliaLinearAlgebra/libblastrampoline#36
Currently only level-1/2/3 S/D/C/Z routines are exported w/
64_
suffix, while libblastrampoline identifies BLAS suffix againstisamax
. This causes the issue above. The patch here should resolve the issue and kickoff lbt+libblis.