-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfault xeigtstz 335 #492
Fix segfault xeigtstz 335 #492
Conversation
…sive flags from the build of zchkee.o
|
I think there's a small regex-typo in #335 (comment). Should be
to cover all of |
I see this as a bit of a bandaid and might cause problems later when we inevitably forget why that special flag was there. I think we should rewrite the eigenvalue tests so that they allocate their workspace explicitly instead of using static allocations. If we don't do that now, perhaps we should add a note about it for later. Looks like @christoph-conrads also mentioned this #335 (comment) ah, but langou mentioned that bandaids are fine for now, nevermind. |
Thanks! Does anyone know how to do it just for the target
Yes, the idea is to avoid spending time here as @langou explained. But I agree we should leave a note about it. I will do it in this PR. |
I'd just put
somewhere near the top of TESTING/EIG/CMakeLists.txt (and as noted elsewhere, this simple fix fails when you try to compile with OpenMP) |
Yes... I ran into this problem and couldn't solve it. I am starting to think how much work would be necessary to apply the solution from #335 (comment) |
probably not that hard, it's just 4 files and a couple of static matrices that need to be changed to allocatable (and then also the allocation and deallocation) |
…STING/EIG/zchkee.f; Improves CMakeLists.txt
This last commit allocates the matrices dynamically.
|
My current impression is that |
Looks like the dimensions might be wrong
If that couses some out of bounds issue, that could cause some NAN values leading to the infinite scaling issue |
That's true! I'm checking this right now... it gets stuck at |
Yes... Sorry! My bad :| |
|
Cute. So there is another flavor of the infinite scaling issue that is not (yet ?) taken care of by the evil "count to 20" hack. Maybe scale itself is NaN already, so won't compare to anything ? |
Well... Maybe, due to bad allocation, |
Thanks for the help! I think I abused the Copy&Paste method in the last commit. Now the tests pass. :) |
If someone has an opinion on how ALLOCATE() should be coded in LAPACK (syntax, etc.), now is the good time to let it known. My opinion: I do not have an opinion. (Yet!) @weslleyspereira introduces ALLOCATE() in TESTING/EIG There already was ALLOCATE() in TESTING/LIN. The style between both is a tat different. I am fine with different styles. We can homogenize, choose, decide later. At this point, I am glad that this 15-year-old bug is gone. Good job @weslleyspereira. I like the fix. Seems what we should have done from the get-go. For some reason, I was reluctant to use ALLOCATE(). (Thanks to @thijssteel to push a little on us using the ALLOCATE().) |
The massive workspace requirements are also present in the other eigenvalue tests, its just that DOUBLE COMPLEX takes up twice the memory as DOUBLE PRECISION / COMPLEX. I suggest changing those too for good measure. Nice fix |
I agree. I can do that. |
I also tested with flag "-fopenmp". In this case, we have:
I don't think these issues are the scope of #492, but it is worth registering. |
Possibly related to machine precision or compiler issues ? I only get the 2 each for REAL and DOUBLE (from error exit tests in ?SYEVD_2STAGE) on i7-7500U |
Codecov Report
@@ Coverage Diff @@
## master #492 +/- ##
=======================================
Coverage 83.33% 83.33%
=======================================
Files 1820 1820
Lines 170857 170857
=======================================
Hits 142384 142384
Misses 28473 28473 Continue to review full report at Codecov.
|
I don't know exactly the cause, but the errors are 9999 FORMAT( ' *** XERBLA was called from ', A, ' with INFO = ', I6,
$ ' instead of ', I2, ' ***' ) What does it mean? Maybe concurrent threads acess the same INFO and that is why we see the problems only with |
Seems the test driver fed it a matrix that led to fewer non-converged off-diagonal elements in the output than expected. This might even be a side effect of thijs' recent work on improving convergence elsewhere in the code. |
So when you compile LAPACK with -fopenmp, chetrd_hb2st.F will use multithreading. I do not know how this interacts with the COMMON blocks of LAPACK/TESTING/EIG/XERBLA.f I do not think this error message as anything to do with a convergence error. We, purposely, call CHBEVD_2STAGE with say a negative value of M (which is an invalid call) to check that CHBEVD_2STAGE calls XERBLA with the correct error message. So we do this to check that the error messages (the calls to XERBLA) are correct. I do not know what to do. Maybe we disable the TESTING of these error messages when we compile LAPACK with OpenMP. COMMON is only used in TESTING for this reason. I do not know whether we should expect the code to work or not. I do not know how to fix this.
In multithreaded execution, it might be that another concurrent thread changes the value of the global variable INFOT. However I do not really understand how that could happen. |
would forcing the number of openmp threads to 1 help? |
Guess I was misled by (in ssyevd_2stage.f)
In any case, I guess one of the threads may be raising an error prematurely, for the data that it sees - but I do not see the ?sb2st |
Yep, running the test with OMP_NUM_THREADS=1 seems to "help". |
How about a nice:
in those specific tests so we can still test the multithreading in the actual routine? |
Based on that, I would say we, maybe, open a new issue to deal with thread safety. I think this PR achieved its goals. |
@@ -1846,8 +1869,16 @@ PROGRAM CCHKEE | |||
CALL ALAREQ( C3, NTYPES, DOTYPE, MAXTYP, NIN, NOUT ) | |||
CALL XLAENV( 1, 1 ) | |||
CALL XLAENV( 9, 25 ) | |||
IF( TSTERR ) | |||
$ CALL CERRST( 'CST', NOUT ) | |||
IF( TSTERR ) THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's it. Something like this and somewhere in the code like this. Good job @weslleyspereira. Thanks @thijssteel
Actually... @langou pushed me forward to try @thijssteel's idea #492 (comment), and it was easier than I thought. See 7c74f6a The non-thread-safe subroutines are Thanks @thijssteel and @langou, and also @martin-frbg who first test it!! :) |
…ault-xeigtstz-335 Fix segfault xeigtstz 335
Closes #335.
This PR adds recursive flags to the default build systems, and fix segfault in EIG tests.