-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect result with cblas_dgemv
vs reference netlib and other libraries
#4324
Comments
Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel |
On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result, |
It is one bit of precision off, very normal occurrence computing in different order. |
I don't think that's right, this is roughly If you do: auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO"); Then it will return |
You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check. It is not magic symbolic computation soup that gives accurate poly result each time. |
I do not expect an identical result, but this result is exactly the starting value of |
It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result. |
just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point. |
As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place) |
I can't understand this conclusion given that I can reproduce this with an Here's what I'm testing on: lscpu
This is on Ubuntu 22.04, with the I'm using OPENBLAS_CORETYPE=Nehalem (No AVX2)
OPENBLAS_CORETYPE=SandyBridge (No AVX2)
OPENBLAS_CORETYPE=Haswell
OPENBLAS_CORETYPE=SkylakeX (AVX512)
In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for Please help me understand how the AVX2 microkernel is the issue here. If |
It will fall back to older compute kernels if you do not have AVX2 in CPUID. |
The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which would also be faster. For some Julia code demonstrating this, see
|
I don't think it is that simple, unless you meant to write "the reference BLAS isn't using FMA" which is trivially true. The OpenBLAS GEMV kernels for the cpus mentioned here all use FMA instructions, the question is if they could/should be rewritten to minimize the difference seen in that particular case. |
hmm. If OpenBLAS GEMV is using fma, what order is it running to match the naive loop without FMA's results? I saw that the results were the results of the obvious algorithm and assumed from there. |
We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.
This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:
Reproduction
I have attached a minimally reproducible example (in C++) showing the problem
Reproduction Code
Compile this code with:
And observe the following output:
OpenBLAS result
It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within
1e-16
. Furthermore, a value of exactly0.0
is, in itself, suspicious, as there's no real circumstance the value could be that.Change the compile command to:
and observe this result:
netlib result
Here is the binary file that contains a 16x16 matrix and a 16x1 vector:
(NOTE: This is a binary data file, extension changed to make github happy)
reproduction.txt
Other Notes
We have done extensive testing in other BLAS-like environments to get a result close to the expected
-4e-16
result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code givesThe matrix in question is not overly ill-conditioned, it has a condition number of ~10.
The text was updated successfully, but these errors were encountered: