-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes underflow reported by @hokb in PR #575 thanks to @thijssteel. #577
Fixes underflow reported by @hokb in PR #575 thanks to @thijssteel. #577
Conversation
Codecov Report
@@ Coverage Diff @@
## master #577 +/- ##
=======================================
Coverage 82.37% 82.37%
=======================================
Files 1894 1894
Lines 190679 190679
=======================================
Hits 157065 157065
Misses 33614 33614
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of ZLAHQR test ZEC runs fine with these changes on ifort. But here, it does not even enter lines 544 ... 548. On our compiler (where RTEMP underflows) it does recover successfully and completes this test iteration within zget37 (knt == 31). However, in subsequent iterations (knt == 49) the result from recovering from underflow in ZLAHQR causes subsequent steps in zget37 to fail (when computing condition numbers -> vmax becomes greater than the threshold). So, maybe the recovering is not yet sufficient or there is some other problem somewhere.
CLAHQR is also fun. Here, and when using 32 bit registers for ABS() it only partially underflows (only the real part when doing REAL(TEMP) * REAL(TEMP)). Hence, the outcome of ABS is wrong - but not RZERO. So, execution does not even enter line 543. We could test for both, the real part and the imaginary part individually. But somehow this path feels not optimal to me...
Can you please indicate in which line the test fails? Is it inside ZTRSNA?
Yeah... I don't feel like it is optimal either. Another solution to this problem would be to use the BLAS routines SCNRM2 and DZNRM2 instead of ABS. These BLAS routines are guaranteed to not overflow or underflow. I think that, if we use them, we won't need to use the workaround proposed by @thijssteel in #575 (comment). Can you test CLAHQR with RTEMP = SCNRM2( TEMP )
H( I, I-1 ) = RTEMP
TEMP = TEMP / RTEMP ? |
It returns from ZTRSNA in line 408 with INFO = 0. But the comparison in line 416 fails for I = 4 (this test iteration is n = 5).
Bam! This makes all of CEC on XEIGTSTC pass. Genius! *
* Ensure that H(I,I-1) is real.
*
TEMP = H( I, I-1 )
IF( AIMAG( TEMP ).NE.RZERO ) THEN
* OLD: RTEMP = ABS( TEMP )
RTEMP = SCNRM2(1, TEMP, 1)
H( I, I-1 ) = RTEMP
TEMP = TEMP / RTEMP
IF( I2.GT.I ) |
Tried DZNRM2 in ZLAHQR. But ZEC tests still fail at the same spot as before. It might be an unrelated issue, though. I figured that s and stmp contain NAN as element 4 and 5... I will try to track it down and report back. |
Maybe instead of RTEMP = SCNRM2(1, TEMP, 1) Something like RTEMP = SLAPY2( REAL(TEMP), AIMAG(TEMP) ) Ditto for D/Z. Note that SLAPY2 should not be better in accuracy than SCNRM2. The suggestion is just a matter of preference. For a complex vector of length 1, I would prefer to call SLAPY2 than SCNRM2. To be fair, I would prefer to call FORTRAN INTRINSIC ABS. It is frustrating that it seems that FORTRAN INTRINSIC ABS is the problem and is unnecessarily underflowing. |
Thank you! It will be very helpful if you can track the bug. |
The problem here is a new issue, caused by ZTRSV overflowing in complex division in line 302: IF (NOUNIT) TEMP = TEMP/DCONJG(A(J,J)) ! <-- gives -Infinity TEMP is: (-3.0390314364021117e-160, 3.1073592763072852e-160) Note, we have been very careful to implement complex division in exactly the same (algorithmic) way as is used in ifort. (It is actually the most straight - forward way.) So I am pretty sure, that this would overflow in FORTRAN as well, if the same numbers would be provided and the same register size used. |
I will give this a try and report back (tomorrow).
As stated above, I am struggling to clearly identify the problem. Neither ABS does something wrong nor the processor. But the only (straight forward) way to make everything succeed seems to be to adopt the exact same things as [name of your popular FORTRAN compiler] does ? (which is not our goal) |
I would suggest that you try using ZLADIV( X, Y ), which computes X/Y without overflowing. But ZLADIV is from LAPACK, and ZTRSV is from BLAS, and BLAS shouldn't rely on LAPACK. Maybe you would be interested in comparing your implementation of complex division with the one in ZLADIV (which relies on DLADIV). |
What about, in ZLAHQR, using ZLATRS instead of ZTRSV? (ZLATRS uses ZLADIV and does scaling to prevent overflow.) |
This sounds tempting. There are multiple complex divisions in ZTRSV which potentially overflow. But ZTRSV is called by ZLATRS... Replacing our simple complex division with ZLADIV as suggested by @weslleyspereira does remove the overflow. Would this be an acceptable solution for you? To use ZLADIV in ZTRSV? (I will prepare an PR) |
Good. Thanks for checking this.
This is correct however ZLATRS is far more than a simple call to ZTRSV. In short ZLATRS calls ZTRSV when it thinks it is overflow-safe to do so. Otherwise it does the operations using ZLADIV. Since there is a significant performance difference between ZTRSV and ZLADIV, the goal is to use ZTRSV whenever possible. However I need to reread the code to know what ZLATRS think is safe. I think its range of safe is much wider than what is working on .NET. So indeed ZLATRS might need be helpful. It would be good to know whether vanilla ZLATRS does the trick or not though. I assume vanilla ZLATRS (instead of ZTRSV) is not working. To repeat if, using .NET, we have Z1 = ( A1=1.0e-160, B1=1.0e-160 ) and Z2 = ( A2=1.0e-160, B2=1.0e-160 ) and then we do the operation Z1 / Z2 then we have an overflow. Is this correct? Is the complex division algorithm used explicitly forming ( A2 **2 + B2 **2 ) by using a formula such as ( A1 + B1 ) * ( A2 - B2 ) / ( A2 **2 + B2 **2 )? (And then yes A2 **2 + B2 **2 underflows to 0 and then division by 0.) I would like to understand better what should be expected from a compiler here. I was naively expecting Z1 / Z2 to be correctly computed as ( 1.0e+00, 0.0e+00 ). @hokb: Can you give use some "simple" overflow/underflow that you see? I was naively expecting all these computation to be done correctly.
I would wait a little. |
It of course depends on how complex division is implemented. The naive formula (as used by ifort btw) does overflow on 64 bit registers. There is only one 'official' .NET Complex datatype. This uses Smith's formula and does not overflow, even when using SSE registers. Same thing with our (ILNumerics) complex and fcomplex datatypes. They use a similar algorithm which does not overflow. But for our FORTRAN compiler we try to mimic the behavior of common FORTRAN code as closely as possible. Hence, we implemented the naive formula (as in the following paragraph).
Either that or overflow happens in the division 1.0 / 1.99997773436537E-320.
This is exactly what I am trying to understand, too! I spent some time stepping through machine instructions generated by ifort. And I was surprised to see that it uses the naive (prone to overflow) formula for complex division. On the other hand, it does not overflow on above example. But this is only due to the 80 bit register used ...
Math.Sqrt(Z1.real * Z1.real + Z1.imag * Z1.imag), using SSE (64 bit) for * gives: 2.8284113805211334e-160
=> gives: Positive Infinity
=> naive formula for Z1 / Z2, using SSE registers for *, gives: (∞+ NaN)
=> naive formula, divisor becomes Infinity, gives (NaN+ NaN).
We could commit to this expectation and use more robust algorithms. The only thing which puzzles me is to see that in the case of ifort (at these specific places) robustness seems to be realized by the use of larger registers. Since, as you know, .NET uses a JIT we cannot guarantee that eventually a specific register is used. The only guarantee is IEEE754. (I realize that this sounds like an argument for using more robust algorithms on our side...)
ok |
(1) I am sorry, I made a mistake. I should have written 1e-200 and not 1e-160. 1e-160 will go in the denormalized region. You can see that the results for
So we do lose quite some precision by using in denormalized. (As expected.) Anyhow. Let me ask again. "For Z1 = ( A1=1.0e-200, B1=1.0e-200 ), is ABS( Z1 ) = 0.0e+00?" (2) I do see how 80-bit can help here. Sure. 15 bits in the exponent for 80-bit registers (instead of 11 bits for 64-bit) will make quite a difference. With 15 bits in the exponent, the highest number is about 1e+4932 and the smallest positive about 1e-4931. Sure. That'll help with overflow and underflow. No need to use an "avoid unnecessary underflow/overflow" algorithm :). Most recent Intel machines do not have 80-bit registers. Do they? I think 80-bit register is a thing of the past. Are you sure that the improved accuracy that you see with ifort is coming from ifort using 80-bit register, as opposed to ifort using a more stable algorithm? I am sorry. I thought 80-bit registers were a thing of the past. (3) I need to review quite a few things here. What does ZLADIV do? What does ZLATRS do? When is ZLATRS used in LAPACK (instead of ZTRSV)? What does the Fortran standard say for complex division and ABS()? What do Fortran compilers do for complex division and ABS()? How much of a slowdown is it to use ZLATRS (as opposed to ZTRSV). I am not up to speed on these topics. Overall, I think .NET should use more robust algorithm for complex division and complex ABS(). I do not think ZLATRS is meant to cover this usage. So while ZLATRS might help, I do not think this will be bullet proof. In addition, I think the slowdown to go from ZLATRS to ZTRSV is too much. I do not think that changing ZTRSV is a good idea to use ZLADIV(). I think it is OK that ZTRSV relies on complex division to not unnecessarily underflow / overflow. If we go that road, we would rewrite all complex divisions in LAPACK and all ABS() with our own functions, and I do not think we want to go that road. I am open to discussion though. We can speak more. |
Naive attempt: 0.0e+0.
Right, a valid and important point!
It looks like ifort does still make use of X87 instructions. (Haven't tried with their latest FORTRAN compiler, wich is released within Intels oneAPI, though)
These are all very valuable statements. My take-away is that we will have to implement the more robust version of basic complex number handling. I was not sure about that and tried to identify the best place to put this responsibility. It is very reasonable now. |
Closes #575.
This PR applies the solution from #575 (comment) to CLAHQR and ZLAHQR.
Thanks to @thijssteel!