-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor requirements for LAPACK #575
Comments
The underflow itself is not the true problem. After underflow, the algorithm switches to CABS1, which is less prone to underflow. The problem that creates is that TEMP will not be exactly unitary, leading to roundoff in Z. A possible solution is to prescale using CABS1 and then correct using ABS (because of the first scaling, ABS should no longer overflow). (I don't get the underflow on my machine, so i can't test it for you)
I think the tests are definitely designed to succeed for the set of popular FORTRAN compilers, because that is simply how they are run. Predicting under/overflow is incredibly hard. At least in my case, these subroutines are designed by simply testing them (using the popular compilers) thoroughly and fixing any over/underflow we find. |
Thank you! This is very helpful.
... this iteration completes successfully (even when using SSE registers for ABS()).
The tests suite is of tremendous help! My rough estimation is that far less than 1% of the tests are affected by this or similar overflow issues (when using our compiler). Making the tests even more robust against under-/overflow could help to bring LAPACK to more platforms. Our (failed) attempt above is just one example, which clearly shows, that we would hardly be able to come up with a fix on our side, though. Before opening multiple related issues I would like to start a discussion, whether or not there is interest in such journey and what would be a good approach. |
Thanks for the improvement @hokb and @thijssteel! Should I write a PR with the modifications or are you willing to do that, @hokb? |
Given my limited experience with the project I would appreciate your effort and the chance to take your PR as a guideline for pot. future PRs from us... (if that's ok?) |
Hi @hokb,
I am not sure anything is specified anywhere.
Bold statement: If all computations are done using IEEE 64-bit arithmetic, then LAPACK should work. LAPACK does not expect 80-bit register to come help its computation at any times. The algorithms are designed with 64-bit arithmetic in mind. Now, as mentioned by @thijssteel, LAPACK is tested with various compilers/architectures, and these compilers/architectures use 80-bit registers at times, and we might think our algorithms only need 64-bit all along, but they do not, and they, in effect, do require an 80-bit. We have not done anything systematic in our journey to go after these issues. In general, we are happy enough when the algorithms pass the test suite, and, if there is some help from 80-bit register, so be it.
Oh my. 1%? That is a scary large number. The tests are testing a lot around underflow and overflow region so it could be expected that the tests are much more likely in term of triggering this issue than users' code but still.
Portability to more platforms is one interest indeed. Another interest is extended precision with package such as GMP where, as I understand, the precision is fixed throughout the computation. (So for example you are 256-bit thought, and there is not a 300-bit register to come help you.)
Yes. We are interested. We can only do so much though. And we have a lot on our plates. So maybe we take this one issue at a time, and we see how far we go. In any case, posting issues on the GitHub is always a good idea. It gives awareness to the problem, and it helps gathering help, ideas to fix the issues. I am happy we go down this path, but I would recommend to take it easy. Maybe, for gfortran, we should compile with the flags |
Sure! Please, see #577. |
@weslleyspereira Awesome! I am still checking, if this applies to CLAHQR the same. Will post my result ASAP (tomorrow) |
Hello @langou !
Nice! I suppose, by 'work' we mean: when fed with data 'in a certain range' it will not overflow due to a given register size?
Sounds very reasonable!
Well, likely it is 'far less' than that ;)
Sounds interesting, but I cannot comment on this, since I lack of experience with such fixed precision attempts.
I am still unsure what a good general approach would be. Bare with me, if my understanding is too naive. But isn't over-/underflow always depending on both: input data and algorithm? So instead of flooding the code with new conditionals testing for and new code for recovering from them we might instead decrease the 'allowed range' for the input data? I don't have the necessary insight into the effort required for either approach, though. So I cannot judge what would be more feasible.
Good. We will file issues as we go. I understand that it will be a challenge to come up with a fix without being able to reproduce an underflow. So, what information can we provide to make the issue more clear? Does the path down to the concrete underflow help? I.e.: providing iteration counts, current values of locals together with file names etc.?
Same thing here! :) |
One outcome of #577 is that LAPACK relies on the FORTRAN compiler to implement reasonably robust (under- / overflow) complex division and ABS(). I wonder if we should start maintaining a document, collecting such and similar requirements? They will be equally important and useful for anyone wanting to use LAPACK with other / new compilers, for compiler builders and in order to transfer parts or all of LAPACK algorithms to other languages? |
Sure! It will be good to have this information well documented. To begin with, I spent some time tracking (maybe) all divisions in the files
I found a total of 53 files. See the attached file: complexDivisionFound.code-search
|
Yes, it should when using GCC but this flag should also be set by default on x86-64. The documentation excerpt below is for GCC 11 but much older GCC versions should exhibit the same behavior. Using the GNU Compiler Collection (GCC): 3.19.59 x86 Options
|
edit: resolved #577 (comment) |
i tried |
@hokb, could you reproduce the overflow issues you mentioned in #575 (comment) with GCC using SSE flags? Can you help me with that? |
@weslleyspereira I haven't even tried GCC. All I have access to / a running setup for is ifort on Windows. It would take some days for me to get GCC up and running via cygwin to test (especially from my current holiday hotel room ... :| ) Let me know if you need me to take this challenge, though ! |
I don't use windows, but I do have it here. I will start by testing LAPACK with ifort on my Ubuntu and see what happens. Enjoy the holiday! |
I am coming back to this issue. Sorry for the delay. @hokb, did you try to compile/test with ifort using the flag '-xSSE2' ? Using the godbolt website, I noticed that:
I ran the LAPACK tests with ifort in my machine. All tests pass using:
|
thanks for the follow-up! It is on my TODO list. Will try & report ASAP (likely after Aug-21). |
Some further investigation on the routines that use the intrinsic Fortran ABS operation for complex numbers: I tracked
For that, I used two REGEX expressions in the Visual Studio Code:
* Note that, potentially, we would have more 60 single-precision routines that use the intrinsic complex ABS. |
@weslleyspereira thanks for your patience. I tried to get back to this today. My attempt to test ifort with /xSSE2 flag was unsuccessful:
Also, in the list of supported opotions I was not able to find any flag to force ifort to use SSE / AVX registers, always (?). Do you (or somebody reading) know which option I could try? IMO the most promising option would be /Qaxcode. According to the docs, it (emphasize is mine) ...
It looks reasonable that such flag could only produce alternative code paths, so that general compatibility would not be compromised ? Compiling the LAPACK lib and the tests with /QaxAVX yields no errors. All tests are passing, too - while running significantly slower than without /QaxAVX. I am still wondering how we could get a value out of this? The goal (I guess) would be to ensure that the tests never require more robustness than what is provided by the implementation of complex division and complex magnitude (and potentially other operations) - regardless if their implementation is straight forward, using x87 registers or if it uses more sophisticated algorithms on SSE registers. But I don't see a reliable way to make sure to have it actually prefer SSE over x87 - always... ? |
Hi @hokb, I think I can reproduce some precision issues for complex data. I tried ifort (IFORT) 2021.3.0 20210609 on my Ubuntu 18.04.5 LTS with the flags From the ifort documentation:
I think the flag So, by default, I think we should expect some compilers will use x87 instructions to improve precision. |
is not available on Windows. With the flag
my Release/x64 build looks good for non-complex types, as expected:
In xeigtstc < nep it got stuck for more than 1h with full core activity. I have tried to build with debug info but this would make all tests fail here. So, unfortunately, I have no more details about the specific place of failure.
it looks, the default is -no-complex-limited-range: by default the compiler does insert robust implementations to the price of some more instructions. Since the LAPACK tests use more than a "limited range" of precision for testing it might be considered a useful recommendation for LAPACK users to not use -complex-limited-range with LAPACK (/tests) ? Naturally, it would be very hard (/impossible within reasonable effort) to define an exact range for floating point values / precision and to guarantee that all functions give reasonable results within that range. What we are dealing with here, is trying to push the allowed range to the maximum possible, right? Still: without even knowing any exact number. What the tests do is to mark a lower limit of precision / value range within which the LAPACK functions do what we expect - given that the underlying technologies (compiler, processor(s)) don't deviate too much from the 'common setup'. It all remains pretty fuzzy. But my impression from what I have learned over the past few months is that this inaccuracy is inevitable for the time being. Mostly because of the lack of some specification, like IEEE754, including complex numbers. |
Hi.
Just to clarify here: I meant that I could compile LAPACK with flag
I agree.
I think we also want to test the precision and robustness of each routine in separate. For example, test if a given routine do not overflow/underflow when the final output is in the floating-point representable range. We recently included safe-scaling, e.g., #527, #514 and #594, that improves precision and enlarge the range of allowed values for a correct output. Since the issue we are facing is more related to the processor than to LAPACK itself, I prepared 2 programs to test the intrinsic Fortran complex division and ABS: |
Great! A very helpful test, @weslleyspereira ! I just ran the zdiv and zabs tests on my Windows machine, using ifort 2021, Release mode, x64, with varying flags: /Od vers. /O3 and w/o '/Qcomplex_limited_range':
Without optimizations either the default implementations are robust enough or x87 registers come to help again (likely). This test nicely demonstrates the trade off between precision and performance. It may also help locating the border line where floating point precision starts to cause trouble for common setups. |
I improved the tests a bit. They now cover ranges of values instead of only the extremes. |
Some tests on my Ubuntu 18.04.5 LTS Test programs: Compilers:
gfortranHere I just changed the optimization flags -O0 and -O3. ABS
Complex division
ifortI played with the options: ABS
Complex division
My conclusions
|
|
I think we may close this issue. We do require that some compiler intrinsic operations are robust enough so we can build some algorithms. #623 introduced some tests that evidence what we expect from the compilers. We may create other tests like those in future. I will close the issue. Let me know if there is still something missing from this issue. Thanks! |
We are working on a translation of LAPACK to .NET. We wrote a FORTRAN compiler which successfully translates all of LAPACK, including all tests. On real data types (almost) all tests pass. On complex data we are seeing a few precision issues, still.
Example:
XEIGTSTZ < zec.in - fails due to underflow in ZLAHQR.
Steps to reproduce: ZGET37 -> knt == 31, ZHSEQR -> ZLAHQR -> at the end of the second QR step (ITS == 2) the following code causes underflow (on certain registers, see below)
Our compiler targets the .NET CLR. Its JIT decides to use SSE registers for ABS(TEMP), which leads to the underflow in the intermediate calculation of the magnitude. Ifort (as another example) uses floating point registers in this situation, hence does not underflow (because of its larger length: 80 bits). I am trying to get a clear(er) picture of what to expect from LAPACK regarding which precision / number range it requires from the compiler / processor at runtime.
Are all tests for double precision designed to require 64 bit registers at least ? Or are they designed in a way to succeed for the set of popular FORTRAN compilers available today? (In the first case above issue (and similar others) may require attention. Should I file an issue for them?)
I looked for some specification but couldn't find it yet. Any link would also be appreciated. Thanks in advance!
The text was updated successfully, but these errors were encountered: