Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault xeigtstz 335 #492

Merged

Conversation

weslleyspereira
Copy link
Collaborator

@weslleyspereira weslleyspereira commented Feb 18, 2021

Closes #335.

This PR adds recursive flags to the default build systems, and fix segfault in EIG tests.

@weslleyspereira
Copy link
Collaborator Author

  • I managed to remove the recursive flags from TESTING/EIG/Makefile using Segfault in ./EIG/xeigtstz #335 (comment)
  • However, I spent some time trying to remove the recursive flags from the CMake build but couldn't make it right. This part is still missing.

@h-vetinari
Copy link
Contributor

I think there's a small regex-typo in #335 (comment). Should be

string(REGEX REPLACE "-[fM]?recursive" "" FFLAGS ${FFLAGS})

to cover all of -recursive, -frecursive, -Mrecursive.

@thijssteel
Copy link
Collaborator

thijssteel commented Feb 18, 2021

I see this as a bit of a bandaid and might cause problems later when we inevitably forget why that special flag was there.

I think we should rewrite the eigenvalue tests so that they allocate their workspace explicitly instead of using static allocations. If we don't do that now, perhaps we should add a note about it for later.

Looks like @christoph-conrads also mentioned this #335 (comment)

ah, but langou mentioned that bandaids are fine for now, nevermind.

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 18, 2021

I think there's a small regex-typo in #335 (comment). Should be

string(REGEX REPLACE "-[fM]?recursive" "" FFLAGS ${FFLAGS})

to cover all of -recursive, -frecursive, -Mrecursive.

Thanks! Does anyone know how to do it just for the target zchkee.o?

I see this as a bit of a bandaid and might cause problems later when we inevitably forget why that special flag was there.

I think we should rewrite the eigenvalue tests so that they allocate their workspace explicitly instead of using static allocations. If we don't do that now, perhaps we should add a note about it for later.

Looks like @christoph-conrads also mentioned this #335 (comment)

ah, but langou mentioned that bandaids are fine for now, nevermind.

Yes, the idea is to avoid spending time here as @langou explained. But I agree we should leave a note about it. I will do it in this PR.

@martin-frbg
Copy link
Collaborator

I'd just put

string(REGEX REPLACE "-[fM]?recursive" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")

somewhere near the top of TESTING/EIG/CMakeLists.txt (and as noted elsewhere, this simple fix fails when you try to compile with OpenMP)

@weslleyspereira
Copy link
Collaborator Author

I'd just put

string(REGEX REPLACE "-[fM]?recursive" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")

somewhere near the top of TESTING/EIG/CMakeLists.txt (and as noted elsewhere, this simple fix fails when you try to compile with OpenMP)

Yes... I ran into this problem and couldn't solve it. I am starting to think how much work would be necessary to apply the solution from #335 (comment)

@thijssteel
Copy link
Collaborator

probably not that hard, it's just 4 files and a couple of static matrices that need to be changed to allocatable (and then also the allocation and deallocation)

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 18, 2021

This last commit allocates the matrices dynamically.

  • Tests with nep.in don't generate SEG FAULT with -frecursive and without changing ulimit -s. This is valid for both Makefile and CMake build systems;
  • Tests with sep.in take forever in my computer, and I couldn't see the results yet. I am putting here to get your feedback, and to see if Travis will have problems with those.

@martin-frbg
Copy link
Collaborator

My current impression is that xeigtstz <sep.in appears to be stuck in ZLARGV, while the other xeigtst? passed reasonably quickly with the same input.

@thijssteel
Copy link
Collaborator

Looks like the dimensions might be wrong

A(NMAX,NMAX) -> A( NMAX*NMAX, NEED )

If that couses some out of bounds issue, that could cause some NAN values leading to the infinite scaling issue

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 18, 2021

That's true! I'm checking this right now... it gets stuck at
https://github.com/Reference-LAPACK/lapack/blob/master/SRC/zlargv.f#L218
with SCALE = 0

@weslleyspereira
Copy link
Collaborator Author

Yes... Sorry! My bad :|

@thijssteel
Copy link
Collaborator

SCALE = 0 get stuck ??? I don't see how that's possible, because then G should be zero.

@martin-frbg
Copy link
Collaborator

Cute. So there is another flavor of the infinite scaling issue that is not (yet ?) taken care of by the evil "count to 20" hack. Maybe scale itself is NaN already, so won't compare to anything ?

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 18, 2021

SCALE = 0 get stuck ??? I don't see how that's possible, because then G should be zero.

Well... Maybe, due to bad allocation, SCALE starts with NaN or Inf at https://github.com/Reference-LAPACK/lapack/blob/master/SRC/zlargv.f#L194
Then line 203 may do the job.

@weslleyspereira
Copy link
Collaborator Author

Thanks for the help! I think I abused the Copy&Paste method in the last commit. Now the tests pass. :)

@langou
Copy link
Contributor

langou commented Feb 18, 2021

If someone has an opinion on how ALLOCATE() should be coded in LAPACK (syntax, etc.), now is the good time to let it known.

My opinion: I do not have an opinion. (Yet!) @weslleyspereira introduces ALLOCATE() in TESTING/EIG There already was ALLOCATE() in TESTING/LIN. The style between both is a tat different. I am fine with different styles. We can homogenize, choose, decide later. At this point, I am glad that this 15-year-old bug is gone. Good job @weslleyspereira. I like the fix. Seems what we should have done from the get-go. For some reason, I was reluctant to use ALLOCATE(). (Thanks to @thijssteel to push a little on us using the ALLOCATE().)

@thijssteel
Copy link
Collaborator

The massive workspace requirements are also present in the other eigenvalue tests, its just that DOUBLE COMPLEX takes up twice the memory as DOUBLE PRECISION / COMPLEX. I suggest changing those too for good measure.

Nice fix

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 18, 2021

The massive workspace requirements are also present in the other eigenvalue tests, its just that DOUBLE COMPLEX takes up twice the memory as DOUBLE PRECISION / COMPLEX. I suggest changing those too for good measure.

I agree. I can do that.

@weslleyspereira
Copy link
Collaborator Author

I also tested with flag "-fopenmp". In this case, we have:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	1316145		0	(0.000%)	2	(0.000%)	
DOUBLE PRECISION	1316967		0	(0.000%)	2	(0.000%)	
COMPLEX          	776316		0	(0.000%)	4	(0.001%)	
COMPLEX16         	777128		0	(0.000%)	4	(0.001%)	

--> ALL PRECISIONS	4186556		0	(0.000%)	12	(0.000%)

testing_results(Makefile).txt

I don't think these issues are the scope of #492, but it is worth registering.

@weslleyspereira weslleyspereira marked this pull request as ready for review February 18, 2021 21:25
@martin-frbg
Copy link
Collaborator

Possibly related to machine precision or compiler issues ? I only get the 2 each for REAL and DOUBLE (from error exit tests in ?SYEVD_2STAGE) on i7-7500U

@codecov
Copy link

codecov bot commented Feb 18, 2021

Codecov Report

Merging #492 (7c74f6a) into master (8960228) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #492   +/-   ##
=======================================
  Coverage   83.33%   83.33%           
=======================================
  Files        1820     1820           
  Lines      170857   170857           
=======================================
  Hits       142384   142384           
  Misses      28473    28473           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8960228...7c74f6a. Read the comment docs.

@weslleyspereira
Copy link
Collaborator Author

Possibly related to machine precision or compiler issues ? I only get the 2 each for REAL and DOUBLE (from error exit tests in ?SYEVD_2STAGE) on i7-7500U

I don't know exactly the cause, but the errors are
*** XERBLA was called from SSYEVD_2STAGE with INFO = 8 instead of 10 ***
*** XERBLA was called from DSYEVD_2STAGE with INFO = 8 instead of 10 ***
*** XERBLA was called from CHEEVD_2STAGE with INFO = 8 instead of 10 ***
*** XERBLA was called from CHBEVD_2STAGE with INFO = 11 instead of 13 ***
*** XERBLA was called from CHBEVD_2STAGE with INFO = 11 instead of 15 ***
*** XERBLA was called from ZHEEVD_2STAGE with INFO = 8 instead of 10 ***
*** XERBLA was called from ZHBEVD_2STAGE with INFO = 11 instead of 13 ***
*** XERBLA was called from ZHBEVD_2STAGE with INFO = 11 instead of 15 ***
and come from https://github.com/Reference-LAPACK/lapack/blob/master/TESTING/xerbla.f#L123 :

 9999 FORMAT( ' *** XERBLA was called from ', A, ' with INFO = ', I6,
     $      ' instead of ', I2, ' ***' )

What does it mean? Maybe concurrent threads acess the same INFO and that is why we see the problems only with -fopenmp.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 19, 2021

Seems the test driver fed it a matrix that led to fewer non-converged off-diagonal elements in the output than expected. This might even be a side effect of thijs' recent work on improving convergence elsewhere in the code.
Update: I get the same failure messages after reverting your PR (but still none for the complex cases)

@langou
Copy link
Contributor

langou commented Feb 19, 2021

  1. In LAPACK/TESTING/EIG we have a special XERBLA that is used to intercept the calls to XERBLA and instead of stopping the execution (as LAPACK/SRC/XERBLA does), this LAPACK/TESTING/EIG /XERBLA checks that the error INFO value is the one that is expected, it also checks that XERBLA is called with the correct routine name. See LAPACK/TESTING/EIG/XERBLA.f

  2. This LAPACK/TESTING/EIG/XERBLA.f uses some COMMON blocks. (Global variables.) So yes, this might not be thread-safe.

  3. The CHBEVD_2STAGE collection of routines uses OPENMP in the HB 2 ST routines, for example in, chetrd_hb2st.F

So when you compile LAPACK with -fopenmp, chetrd_hb2st.F will use multithreading. I do not know how this interacts with the COMMON blocks of LAPACK/TESTING/EIG/XERBLA.f

I do not think this error message as anything to do with a convergence error. We, purposely, call CHBEVD_2STAGE with say a negative value of M (which is an invalid call) to check that CHBEVD_2STAGE calls XERBLA with the correct error message. So we do this to check that the error messages (the calls to XERBLA) are correct.

I do not know what to do. Maybe we disable the TESTING of these error messages when we compile LAPACK with OpenMP. COMMON is only used in TESTING for this reason. I do not know whether we should expect the code to work or not. I do not know how to fix this.

*** XERBLA was called from SSYEVD_2STAGE with INFO = 8 instead of 10 ***
My interpretation is that there is a global variable INFOT that we set on say -8, then we call SSYEVD_2STAGE with the 8th variable invalid, (and 1st, 2nd, etc, to 7th valid,) so that a correct behavior is that SSYEVD_2STAGE calls XERBLA with INFO=-8, then the "testing" XERBLA checks whether the global variable INFOT (which is -8) matches its input variable INFO (which should be -8). If global variable INFOT matches input variable INFO, then good. If global variable INFOT (-10) does not match input variable INFO (-8), then you see the message below and this will count as one "error" in the error message testing of LAPACK.
*** XERBLA was called from SSYEVD_2STAGE with INFO = 8 instead of 10 ***

In multithreaded execution, it might be that another concurrent thread changes the value of the global variable INFOT. However I do not really understand how that could happen.

@thijssteel
Copy link
Collaborator

would forcing the number of openmp threads to 1 help?

@martin-frbg
Copy link
Collaborator

Guess I was misled by (in ssyevd_2stage.f)

        INFO is INTEGER
*>          = 0:  successful exit
*>          < 0:  if INFO = -i, the i-th argument had an illegal value
*>          > 0:  if INFO = i and JOBZ = 'N', then the algorithm failed
*>                to converge; i off-diagonal elements of an intermediate
*>                tridiagonal form did not converge to zero;
*>                if INFO = i and JOBZ = 'V', then the algorithm failed
*>                to compute an eigenvalue while working on the submatrix
*>                lying in rows and columns INFO/(N+1) through
*>                mod(INFO,N+1).

In any case, I guess one of the threads may be raising an error prematurely, for the data that it sees - but I do not see the ?sb2st
access INFO in or after its openmp-parallel region.

@martin-frbg
Copy link
Collaborator

Yep, running the test with OMP_NUM_THREADS=1 seems to "help".

@thijssteel
Copy link
Collaborator

thijssteel commented Feb 19, 2021

How about a nice:

#if defined(_OPENMP)
    N_THREADS = OMP_GET_NUM_THREADS()
    CALL OMP_SET_NUM_THREADS(1)
#endif
....
#if defined(_OPENMP)
    CALL OMP_SET_NUM_THREADS(N_THREADS)
#endif

in those specific tests so we can still test the multithreading in the actual routine?

@weslleyspereira
Copy link
Collaborator Author

Seems the test driver fed it a matrix that led to fewer non-converged off-diagonal elements in the output than expected. This might even be a side effect of thijs' recent work on improving convergence elsewhere in the code.
Update: I get the same failure messages after reverting your PR (but still none for the complex cases)

Based on that, I would say we, maybe, open a new issue to deal with thread safety. I think this PR achieved its goals.

@@ -1846,8 +1869,16 @@ PROGRAM CCHKEE
CALL ALAREQ( C3, NTYPES, DOTYPE, MAXTYP, NIN, NOUT )
CALL XLAENV( 1, 1 )
CALL XLAENV( 9, 25 )
IF( TSTERR )
$ CALL CERRST( 'CST', NOUT )
IF( TSTERR ) THEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's it. Something like this and somewhere in the code like this. Good job @weslleyspereira. Thanks @thijssteel

@weslleyspereira
Copy link
Collaborator Author

weslleyspereira commented Feb 19, 2021

Actually... @langou pushed me forward to try @thijssteel's idea #492 (comment), and it was easier than I thought.

See 7c74f6a

The non-thread-safe subroutines are xERRST

Thanks @thijssteel and @langou, and also @martin-frbg who first test it!! :)

@langou langou merged commit 47a4a75 into Reference-LAPACK:master Feb 19, 2021
@weslleyspereira weslleyspereira mentioned this pull request Feb 20, 2021
christoph-conrads pushed a commit to christoph-conrads/lapack that referenced this pull request May 23, 2021
christoph-conrads pushed a commit to christoph-conrads/lapack that referenced this pull request May 23, 2021
…ault-xeigtstz-335

Fix segfault xeigtstz 335
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault in ./EIG/xeigtstz
5 participants