Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel MKL gives nan on cholmod_dl_demo for nd6k.mtx as the input. #554

Closed
syby119 opened this issue Nov 30, 2023 · 39 comments
Closed

Intel MKL gives nan on cholmod_dl_demo for nd6k.mtx as the input. #554

syby119 opened this issue Nov 30, 2023 · 39 comments
Assignees
Labels
bug Something isn't working

Comments

@syby119
Copy link
Contributor

syby119 commented Nov 30, 2023

Using Intel MKL as the backend to provide blas and lapack, the cholmod_dl_demo gives warning when factorize the matrix.

CHOLMOD warning: matrix not positive definite. file: C:\Users\syby119\github\syby119\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086
cholmod error: file: C:\Users\syby119\github\syby119\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086 status: 1: matrix not positive definite

And the result is wrong.

residual (|Ax-b|/(|A||x|+|b|)):      nan      nan
residual      nan (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     0.0e+00

maxresid: nan : test failure
FAIL maxresid:        nan (double) : C:\Users\syby119\Desktop\nd6k.mtx

This phenomenon happens on both windows / WSL with Ubuntu 22.04, both with cpu/gpu.

Howerver, when I tried to use openblas and lapack as the backend, the result is correct, both with cpu/gpu.

maxresid: 1.66778e-15 : All tests passed
OK maxresid:    1.7e-15 (double) : stdin

I think maybe this is an external bug and we should not use MKL as an backend for now.

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

The CUDA version is 12.0 and openmp is enabled.
The Intel MKL is provided by oneApi with version 2024.0.0.

@DrTimothyAldenDavis
Copy link
Owner

This can happen if the BLAS integer size is not detected properly. What is the SUITESPARSE_BLAS_INT, and does it match the MKL library you are using?

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

This can happen if the BLAS integer size is not detected properly. What is the SUITESPARSE_BLAS_INT, and does it match the MKL library you are using?

On my machine, the SUITESPARSE_BLAS_INT is int32_t in cholmod_dl_demo on windows.
I'll try to find a way to check MKL related issue on Windows later.

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

I use cholmod_l_print_common to print information about my system

CHOLMOD version 5.1.0, Dec 30, 2023: message: status: OK
  Architecture:
    sizeof(int):      4
    sizeof(int64_t):  8
    sizeof(void *):   8
    sizeof(double):   8
    sizeof(Int):      8 (CHOLMOD's basic integer)
    sizeof(SUITESPARSE_BLAS_INT): 4 (integer used in the BLAS)

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

I think the blas integer type is right.

-- Looking for Intel 32-bit BLAS
-- Found Intel10_64lp 32-bit BLAS
-- OpenMP C libraries:      /usr/lib/gcc/x86_64-linux-gnu/11/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.a 
-- OpenMP C include:         
-- OpenMP C flags:          -fopenmp 
-- BLAS libraries:      /opt/intel/oneapi/mkl/2024.0/lib/libmkl_gf_lp64.so;/opt/intel/oneapi/mkl/2024.0/lib/libmkl_gnu_thread.so;/opt/intel/oneapi/mkl/2024.0/lib/libmkl_core.so;/usr/lib/gcc/x86_64-linux-gnu/11/libgomp.so;-lm;-ldl 
-- BLAS linker flags:    
-- BLAS include:         
-- ------------------------------------------------------------------------
-- SuiteSparse CMAKE report for: SuiteSparseConfig
-- ------------------------------------------------------------------------
-- inside common SuiteSparse root:  false
-- install in SuiteSparse/lib and SuiteSparse/include: OFF
-- build type:           Release
-- BUILD_SHARED_LIBS:    ON
-- BUILD_STATIC_LIBS:    ON
-- use OpenMP:           yes 
-- C compiler:           /usr/bin/cc 
-- C flags:               -fopenmp 
-- C++ compiler:         
-- C++ flags:            
-- C Flags release:      -O3 -DNDEBUG 
-- C++ Flags release:     
-- Fortran compiler:     /usr/bin/f95 
-- compile definitions:  BLAS_Intel10_64lp
-- BLAS integer:         int32_t

@DrTimothyAldenDavis
Copy link
Owner

Ok. But mkl has both 32 and 64 bit versions. If you have 4 bytes but mkl has 8, then that would break. Open BLAS has 32 bits only

@DrTimothyAldenDavis
Copy link
Owner

Got it. Just saw that now.

That all looks ok.

I haven't tried the 2024 mkl BLAS yet on Linux. I will give it a try

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

Got it. Just saw that now.

That all looks ok.

I haven't tried the 2024 mkl BLAS yet on Linux. I will give it a try

Oh, I see. One should always set up the environment variable with

. /opt/intel/oneapi/setvars.sh

Everything works well now.

On windows, the internal environment of Visual Studio is some what conflict with MKL. When setup the environment using the cmd, the program can work well. However in Visual Studio, It reports an error. I will try to work out a way to setup the enviroment on Visual Studio IDE.

@syby119 syby119 closed this as completed Nov 30, 2023
@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

For windows Visual Studio Users:
Open cmd to set environment variables

cd C:\Program Files (x86)\Intel\oneAPI
setvars.bat

With the same cmd, compile your code

cd path/to/SuiteSparse
cmake -Bbuild .
cmake --build build --config=Release --parallel

With the same cmd, open the project with

cd build
SuiteSparse.sln

It seems in this way, the MSVC inherit the enviroment variables setup by Intel oneApi.

@syby119
Copy link
Contributor Author

syby119 commented Nov 30, 2023

Got it. Just saw that now.

That all looks ok.

I haven't tried the 2024 mkl BLAS yet on Linux. I will give it a try

Oops, it seems that Windows use GPU is the only one cannot pass the cholmod_dl_demo.exe.
Whereas Windows use CPU can pass and WSL use CPU/GPU can both pass the demo.
The error message is listed as above.

Environment

CHOLMOD version 5.1.0, Dec 30, 2023: message: status: OK
  Architecture:
    sizeof(int):      4
    sizeof(int64_t):  8
    sizeof(void *):   8
    sizeof(double):   8
    sizeof(Int):      8 (CHOLMOD's basic integer)
    sizeof(SUITESPARSE_BLAS_INT): 4 (integer used in the BLAS)

BLAS

-- Looking for Intel 32-bit BLAS
-- Found Intel10_64lp 32-bit BLAS

@syby119 syby119 reopened this Nov 30, 2023
@mmuetzel
Copy link
Contributor

mmuetzel commented Dec 1, 2023

I just opened #559 that adds a runner for that configuration to the CI. All ctests have passed on prior tests on my fork on GitHub.
I don't know if this particular demo runs as part of the ctests though.

@syby119
Copy link
Contributor Author

syby119 commented Dec 2, 2023

I just opened #559 that adds a runner for that configuration to the CI. All ctests have passed on prior tests on my fork on GitHub. I don't know if this particular demo runs as part of the ctests though.

Thanks for the development of the library, the configuration of the project becomes super easy (with vcpkg).

I have rebuilt the newest dev branch (0b41a3e) with Intel oneApi MKL 2024.0 + CUDA 12.0, and specify use GPU manually

// cholmod_sl_demo.c / cholmod_dl_demo.c
cm = &Common ;
cholmod_l_start (cm) ;
cm->print = 4 ;

// Add: Switch useGPU on
cm->useGPU = 1;

The cholmod_sl_demo get the correct result

residual (|Ax-b|/(|A||x|+|b|)): 3.14e-07 2.81e-07
residual  8.2e-08 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     4.4e-04

maxresid: 3.1437e-07 : All tests passed
OK maxresid:    3.1e-07 (single) : C:\Users\syby119\Desktop\nd6k.mtx

Howerver, the cholmod_dl_demo get the wrong result

Analyze: flop 1.11904e+11 lnz 3.97732e+07
Factorizing A
CHOLMOD warning: matrix not positive definite. file: C:\Users\syby119\github\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086
cholmod error: file: C:\Users\syby119\github\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086 status: 1: matrix not positive definite

....


residual (|Ax-b|/(|A||x|+|b|)):      nan      nan
residual      nan (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     0.0e+00

I can obverse that the GPU is used with the Windows task manager. Besides, the cpu version of the both demo get the currect result. For example , the cholmod_dl_demo gives

residual (|Ax-b|/(|A||x|+|b|)): 7.26e-16 7.46e-16
residual  1.7e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04

maxresid: 7.45537e-16 : All tests passed
OK maxresid:    7.5e-16 (double) : C:\Users\syby119\Desktop\nd6k.mtx

@DrTimothyAldenDavis
Copy link
Owner

The GPU is only used for the double case, so cholmod_sl_demo won't trigger it.

If you uncomment this line:

// Uncomment this line to get a summary of the time spent in the BLAS,
// for development diagnostics only:
// #define BLAS_TIMER

and recompile, it will print the # of calls in the GPU. That will verify that the GPU is being used.

If you enable the GPU but use another BLAS, does the error go away?

I don't do a lot of testing on Windows (just the CI in github), so it's possible something is not working there. This cholmod_dl_demo works fine on my Linux desktop with its GPUs. But I'm using CUDA 11.7 not 12, and an older MKL (2022 if I recall).

@syby119
Copy link
Contributor Author

syby119 commented Dec 2, 2023

The GPU is only used for the double case, so cholmod_sl_demo won't trigger it.

If you uncomment this line:

// Uncomment this line to get a summary of the time spent in the BLAS,
// for development diagnostics only:
// #define BLAS_TIMER

and recompile, it will print the # of calls in the GPU. That will verify that the GPU is being used.

If you enable the GPU but use another BLAS, does the error go away?

I don't do a lot of testing on Windows (just the CI in github), so it's possible something is not working there. This cholmod_dl_demo works fine on my Linux desktop with its GPUs. But I'm using CUDA 11.7 not 12, and an older MKL (2022 if I recall).

Thank you for your kind reply. I'll try to use another BLAS/Lapack on windows for testing.

I'm not pretty sure that cholmod_sl_demo don't use GPU at all. Although the printed result gives no GPU calls.

cm->useGPU=1

nmethods: 1
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze cputime:       13.7128                                     <----------------------------- super slow
factor  cputime:         0.4576 mflop: 244571.9
solve   cputime:         0.0184 mflop:   8630.4
overall cputime:        14.1888 mflop:   7898.0
solve   cputime:         0.0175 mflop:   9077.7 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          365 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 3.14e-07 2.81e-07
residual  8.2e-08 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     4.4e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          857 time   1.3654e-01
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          683 time   6.9120e-02
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          175 time   7.3236e-02
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          174 time   3.0214e-02
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.0911e-01 GPU   0.0000e+00 total:   3.0911e-01
maxresid: 3.1437e-07 : All tests passed
OK maxresid:    3.1e-07 (single) : C:\Users\syby119\Desktop\nd6k.mtx

Howerver, with cm->useGPU = 0, the demo runs much much faster then cm->useGPU = 1. And in this case, the windows task manager shows no GPU ulility at all.

cm->useGPU=0

nmethods: 1
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze cputime:        0.7724                                     <----------------------------- super fast
factor  cputime:         0.4077 mflop: 274450.9
solve   cputime:         0.0187 mflop:   8512.1
overall cputime:         1.1988 mflop:  93477.2
solve   cputime:         0.0180 mflop:   8854.5 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          365 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 3.14e-07 2.81e-07
residual  8.2e-08 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     4.4e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          857 time   1.1811e-01
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          683 time   6.7922e-02
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          175 time   7.2116e-02
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          174 time   2.9572e-02
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   2.8772e-01 GPU   0.0000e+00 total:   2.8772e-01
maxresid: 3.1437e-07 : All tests passed
OK maxresid:    3.1e-07 (single) : C:\Users\syby119\Desktop\nd6k.mtx

I'm quite confused that useGPU=1 cause the demo runs much slower in the analyze step, how can it be ?

@syby119
Copy link
Contributor Author

syby119 commented Dec 2, 2023

The GPU is only used for the double case, so cholmod_sl_demo won't trigger it.

If you uncomment this line:

// Uncomment this line to get a summary of the time spent in the BLAS,
// for development diagnostics only:
// #define BLAS_TIMER

and recompile, it will print the # of calls in the GPU. That will verify that the GPU is being used.

If you enable the GPU but use another BLAS, does the error go away?

I don't do a lot of testing on Windows (just the CI in github), so it's possible something is not working there. This cholmod_dl_demo works fine on my Linux desktop with its GPUs. But I'm using CUDA 11.7 not 12, and an older MKL (2022 if I recall).

I've searched desperately for the blas and lapack that can be used on Windows, but I failed to find one.Instead, I try to use the project suitesparse-metis-for-windows.

First, I commented hunter download package for MKL in the top level CMakeLists.txt line 275 to ensure the library uses the system-wise MKL, i.e, MKL that is provided with inter OneApi 2024.0.

elseif (WITH_MKL)
#   hunter_add_package(mkl)
  set(BLA_VENDOR "Intel10_64lp")
  find_package(BLAS REQUIRED)
  find_package(LAPACK REQUIRED)
  set(SuiteSparse_LINKER_LAPACK_BLAS_LIBS ${LAPACK_LIBRARIES} ${BLAS_LIBRARIES})
  # for suitesparse-config file set method used to find LAPACK (and BLAS)
  set(SuiteSparse_LAPACK_used_CONFIG NO)
else()

Then I built the project(suitesparse-metis-for-windows) with MKL and CUDA.

cmake -Bbuild_mkl -DWITH_CUDA=ON -DWITH_MKL=ON
cmake --build build_mkl --config Release --parallel

Finally I use the aforementioned compiled library to run cholmod_dl_demo(I changed very little code for API compatibility issue), with useGPU=1, the demos runs well.

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          557 time   0.0000e+00
      GPU calls          224 time   0.0000e+00
GEMM  CPU calls          409 time   0.0000e+00
      GPU calls          224 time   0.0000e+00
POTRF CPU calls          156 time   0.0000e+00
      GPU calls            8 time   0.0000e+00
TRSM  CPU calls          156 time   0.0000e+00
      GPU calls            7 time   0.0000e+00
time in the BLAS: CPU   0.0000e+00 GPU   0.0000e+00 total:   0.0000e+00
assembly time   0.0000e+00    0.0000e+00
maxresid: 1.22567e-15 : All tests passed
OK maxresid:    1.2e-15 (double) : C:\Users\syby119\Desktop\nd6k.mtx

I really wonder if the implementation is different in version 5.4.0 from that in version 7.4.0.

@syby119
Copy link
Contributor Author

syby119 commented Dec 2, 2023

I just opened #559 that adds a runner for that configuration to the CI. All ctests have passed on prior tests on my fork on GitHub. I don't know if this particular demo runs as part of the ctests though.

I manully changed the cm->useGPU=1 in the cholmod_dl_demo. All tests passed on my Windows with CUDA 12.0 + MKL 2024.0.

Howerver, when I want to compare the performance of the CPU/GPU version and use the recommended dataset with nd6k.mtx, the error happened.

@syby119
Copy link
Contributor Author

syby119 commented Dec 2, 2023

Specify cm->useGPU=1 in cholmod_dl_demo and recompile the code. After that run the target RUN_TESTS

1>Test project C:/Users/syby119/github/SuiteSparse/build
1>      Start 29: CHOLMOD_int64_double_supernodal
1>      Start 31: CHOLMOD_int64_single_supernodal
1>      Start 54: LAGraph_SingleSourceShortestPath
1>      Start 44: LAGraph_IsEqual
1>      Start 67: LAGraph_vector
1>      Start 57: LAGraph_TriangleCount
1>      Start 46: LAGraph_MMRead
1>      Start 69: LAGraphX_AllKtruss
1>      Start 33: LAGraph_BreadthFirstSearch
1>      Start 84: LAGraphX_dnn
1>      Start 80: LAGraphX_SWrite
1>      Start 39: LAGraph_ConnectedComponents
1>      Start 75: LAGraphX_KTruss
1>      Start 55: LAGraph_Sort
1>      Start 83: LAGraphX_cdlp
1>      Start 87: LAGraphX_scc
1>      Start 41: LAGraph_DisplayGraph
1>      Start 66: LAGraph_minmax
1>      Start 56: LAGraph_SortByDegree
1>      Start 62: LAGraph_Xinit
1>      Start 64: LAGraph_export
1>      Start  1: Mongoose_IO_Test
1>      Start  4: Mongoose_Weighted_Edge_Separator_Test
1>      Start 36: LAGraph_Cached_NDiag
1>      Start  8: Mongoose_Performance_Test_2
1>      Start 37: LAGraph_Cached_SymmetricStructure
1>      Start 38: LAGraph_CheckGraph
1>      Start 50: LAGraph_New
1>      Start  3: Mongoose_Edge_Separator_Test_2
1>      Start 35: LAGraph_Cached_Degree
1>      Start 32: LAGraph_Betweenness
1>      Start  2: Mongoose_Edge_Separator_Test
1> 1/87 Test #32: LAGraph_Betweenness .....................   Passed    0.44 sec
1>      Start 60: LAGraph_Vector_Structure
1> 2/87 Test #62: LAGraph_Xinit ...........................   Passed    0.97 sec
1>      Start  7: Mongoose_Performance_Test
1> 3/87 Test #38: LAGraph_CheckGraph ......................   Passed    0.87 sec
1>      Start 48: LAGraph_Matrix_Structure
1> 4/87 Test #50: LAGraph_New .............................   Passed    0.89 sec
1>      Start  6: Mongoose_Memory_Test
1> 5/87 Test #35: LAGraph_Cached_Degree ...................   Passed    0.91 sec
1>      Start 40: LAGraph_DeleteCached
1> 6/87 Test #41: LAGraph_DisplayGraph ....................   Passed    1.39 sec
1>      Start  5: Mongoose_Target_Split_Test
1> 7/87 Test #36: LAGraph_Cached_NDiag ....................   Passed    1.24 sec
1>      Start 70: LAGraphX_BF
1> 8/87 Test #37: LAGraph_Cached_SymmetricStructure .......   Passed    1.29 sec
1>      Start 53: LAGraph_SampleDegree
1> 9/87 Test #56: LAGraph_SortByDegree ....................   Passed    1.47 sec
1>      Start 34: LAGraph_Cached_AT
1>10/87 Test  #8: Mongoose_Performance_Test_2 .............   Passed    1.49 sec
1>      Start 82: LAGraphX_TriangleCentrality
1>11/87 Test #60: LAGraph_Vector_Structure ................   Passed    0.90 sec
1>      Start 59: LAGraph_Vector_Print
1>12/87 Test #66: LAGraph_minmax ..........................   Passed    1.88 sec
1>      Start 45: LAGraph_KindName
1>13/87 Test #48: LAGraph_Matrix_Structure ................   Passed    0.84 sec
1>      Start 68: LAGraphX_AllKCore
1>14/87 Test #53: LAGraph_SampleDegree ....................   Passed    0.58 sec
1>      Start 85: LAGraphX_lcc
1>15/87 Test #34: LAGraph_Cached_AT .......................   Passed    0.69 sec
1>      Start 79: LAGraphX_SSaveSet
1>16/87 Test #64: LAGraph_export ..........................   Passed    2.26 sec
1>      Start 74: LAGraphX_KCoreDecompose
1>17/87 Test #40: LAGraph_DeleteCached ....................   Passed    1.12 sec
1>      Start 58: LAGraph_Type
1>18/87 Test #87: LAGraphX_scc ............................   Passed    2.57 sec
1>      Start 73: LAGraphX_KCore
1>19/87 Test #55: LAGraph_Sort ............................   Passed    2.67 sec
1>      Start 76: LAGraphX_MaximalIndependentSet
1>20/87 Test #59: LAGraph_Vector_Print ....................   Passed    0.91 sec
1>      Start 23: CHOLMOD_int64_single_can24
1>21/87 Test #45: LAGraph_KindName ........................   Passed    0.79 sec
1>      Start 71: LAGraphX_FastGraphletTransform
1>22/87 Test #68: LAGraphX_AllKCore .......................   Passed    0.82 sec
1>      Start 19: CHOLMOD_int64_single_lp_afiro
1>23/87 Test #74: LAGraphX_KCoreDecompose .................   Passed    0.64 sec
1>      Start 18: CHOLMOD_int32_single_lp_afiro
1>24/87 Test #82: LAGraphX_TriangleCentrality .............   Passed    1.48 sec
1>      Start 22: CHOLMOD_int32_single_can24
1>25/87 Test #85: LAGraphX_lcc ............................   Passed    1.14 sec
1>      Start 49: LAGraph_Multiply_size_t
1>26/87 Test #58: LAGraph_Type ............................   Passed    0.86 sec
1>      Start 52: LAGraph_PageRank
1>27/87 Test #70: LAGraphX_BF .............................   Passed    1.96 sec
1>      Start 86: LAGraphX_msf
1>28/87 Test #71: LAGraphX_FastGraphletTransform ..........   Passed    0.79 sec
1>      Start 28: CHOLMOD_int32_double_supernodal
1>29/87 Test #73: LAGraphX_KCore ..........................   Passed    1.03 sec
1>      Start 27: CHOLMOD_int64_single_complex
1>30/87 Test #79: LAGraphX_SSaveSet .......................   Passed    1.31 sec
1>      Start 51: LAGraph_NumThreads
1>31/87 Test #49: LAGraph_Multiply_size_t .................   Passed    0.47 sec
1>      Start 25: CHOLMOD_int64_double_complex
1>32/87 Test #52: LAGraph_PageRank ........................   Passed    0.41 sec
1>      Start 61: LAGraph_WallClockTime
1>33/87 Test #19: CHOLMOD_int64_single_lp_afiro ...........   Passed    1.07 sec
1>      Start 15: CHOLMOD_int64_single_bcsstk01
1>34/87 Test #23: CHOLMOD_int64_single_can24 ..............   Passed    1.24 sec
1>      Start 30: CHOLMOD_int32_single_supernodal
1>35/87 Test #18: CHOLMOD_int32_single_lp_afiro ...........   Passed    0.93 sec
1>      Start 16: CHOLMOD_int32_double_lp_afiro
1>36/87 Test #22: CHOLMOD_int32_single_can24 ..............   Passed    0.78 sec
1>      Start 26: CHOLMOD_int32_single_complex
1>37/87 Test #51: LAGraph_NumThreads ......................   Passed    0.36 sec
1>      Start 24: CHOLMOD_int32_double_complex
1>38/87 Test #83: LAGraphX_cdlp ...........................   Passed    4.11 sec
1>      Start 12: CHOLMOD_int32_double_bcsstk01
1>39/87 Test #76: LAGraphX_MaximalIndependentSet ..........   Passed    1.50 sec
1>      Start 72: LAGraphX_HelloWorld
1>40/87 Test  #1: Mongoose_IO_Test ........................   Passed    4.12 sec
1>      Start 78: LAGraphX_Random_Matrix
1>41/87 Test #61: LAGraph_WallClockTime ...................   Passed    0.73 sec
1>      Start 21: CHOLMOD_int64_double_can24
1>42/87 Test #28: CHOLMOD_int32_double_supernodal .........   Passed    1.00 sec
1>      Start 20: CHOLMOD_int32_double_can24
1>43/87 Test #27: CHOLMOD_int64_single_complex ............   Passed    0.92 sec
1>      Start 77: LAGraphX_Random
1>44/87 Test  #4: Mongoose_Weighted_Edge_Separator_Test ...   Passed    4.44 sec
1>      Start 47: LAGraph_Malloc
1>45/87 Test #86: LAGraphX_msf ............................   Passed    1.31 sec
1>      Start 81: LAGraphX_SquareClustering
1>46/87 Test #72: LAGraphX_HelloWorld .....................   Passed    0.61 sec
1>      Start 17: CHOLMOD_int64_double_lp_afiro
1>47/87 Test #78: LAGraphX_Random_Matrix ..................   Passed    0.52 sec
1>      Start 14: CHOLMOD_int32_single_bcsstk01
1>48/87 Test #77: LAGraphX_Random .........................   Passed    0.30 sec
1>      Start 42: LAGraph_Init
1>49/87 Test #47: LAGraph_Malloc ..........................   Passed    0.35 sec
1>      Start 13: CHOLMOD_int64_double_bcsstk01
1>50/87 Test #81: LAGraphX_SquareClustering ...............   Passed    0.32 sec
1>      Start 43: LAGraph_Init_errors
1>51/87 Test #25: CHOLMOD_int64_double_complex ............   Passed    1.37 sec
1>      Start  9: Mongoose_Unit_Test_IO
1>52/87 Test #16: CHOLMOD_int32_double_lp_afiro ...........   Passed    1.17 sec
1>      Start 11: Mongoose_Unit_Test_EdgeSep
1>53/87 Test #24: CHOLMOD_int32_double_complex ............   Passed    1.18 sec
1>      Start 65: LAGraph_fopen
1>54/87 Test #33: LAGraph_BreadthFirstSearch ..............   Passed    5.34 sec
1>      Start 63: LAGraph_acutest
1>55/87 Test #30: CHOLMOD_int32_single_supernodal .........   Passed    1.31 sec
1>      Start 10: Mongoose_Unit_Test_Graph
1>56/87 Test #42: LAGraph_Init ............................   Passed    0.49 sec
1>57/87 Test #26: CHOLMOD_int32_single_complex ............   Passed    1.35 sec
1>58/87 Test #43: LAGraph_Init_errors .....................   Passed    0.29 sec
1>59/87 Test #15: CHOLMOD_int64_single_bcsstk01 ...........   Passed    1.41 sec
1>60/87 Test #20: CHOLMOD_int32_double_can24 ..............   Passed    0.81 sec
1>61/87 Test #12: CHOLMOD_int32_double_bcsstk01 ...........   Passed    1.23 sec
1>62/87 Test #21: CHOLMOD_int64_double_can24 ..............   Passed    0.88 sec
1>63/87 Test #17: CHOLMOD_int64_double_lp_afiro ...........   Passed    0.58 sec
1>64/87 Test  #9: Mongoose_Unit_Test_IO ...................   Passed    0.25 sec
1>65/87 Test #14: CHOLMOD_int32_single_bcsstk01 ...........   Passed    0.51 sec
1>66/87 Test #11: Mongoose_Unit_Test_EdgeSep ..............   Passed    0.20 sec
1>67/87 Test  #5: Mongoose_Target_Split_Test ..............   Passed    3.91 sec
1>68/87 Test  #7: Mongoose_Performance_Test ...............   Passed    4.26 sec
1>69/87 Test  #3: Mongoose_Edge_Separator_Test_2 ..........   Passed    4.93 sec
1>70/87 Test #65: LAGraph_fopen ...........................   Passed    0.17 sec
1>71/87 Test #63: LAGraph_acutest .........................   Passed    0.14 sec
1>72/87 Test #10: Mongoose_Unit_Test_Graph ................   Passed    0.07 sec
1>73/87 Test #75: LAGraphX_KTruss .........................   Passed    5.40 sec
1>74/87 Test  #2: Mongoose_Edge_Separator_Test ............   Passed    4.90 sec
1>75/87 Test #84: LAGraphX_dnn ............................   Passed    5.56 sec
1>76/87 Test  #6: Mongoose_Memory_Test ....................   Passed    4.26 sec
1>77/87 Test #80: LAGraphX_SWrite .........................   Passed    5.71 sec
1>78/87 Test #39: LAGraph_ConnectedComponents .............   Passed    5.73 sec
1>79/87 Test #13: CHOLMOD_int64_double_bcsstk01 ...........   Passed    1.33 sec
1>80/87 Test #46: LAGraph_MMRead ..........................   Passed    7.40 sec
1>81/87 Test #69: LAGraphX_AllKtruss ......................   Passed    7.71 sec
1>82/87 Test #57: LAGraph_TriangleCount ...................   Passed    7.86 sec
1>83/87 Test #67: LAGraph_vector ..........................   Passed    8.73 sec
1>84/87 Test #44: LAGraph_IsEqual .........................   Passed    8.88 sec
1>85/87 Test #54: LAGraph_SingleSourceShortestPath ........   Passed    9.85 sec
1>86/87 Test #31: CHOLMOD_int64_single_supernodal .........   Passed   13.79 sec
1>87/87 Test #29: CHOLMOD_int64_double_supernodal .........   Passed   15.83 sec
1>
1>100% tests passed, 0 tests failed out of 87
1>
1>Total Test time (real) =  15.85 sec

@DrTimothyAldenDavis
Copy link
Owner

The CMake tests for CHOLMOD are very short, and do not exercise the GPU. So the fact that they passed doesn't have any impact on how the GPU is working.

@DrTimothyAldenDavis
Copy link
Owner

The analyze phase is independent of the data type (single or double), so it can be used by a subsequent single or double factorization, regardless of the type of the input matrix. Only double is done on the GPU, but the analyze phase just looks at cm->useGPU, not the type.

I can revise my rules so that the analysis for the GPU is only done if the input matrix has a dtype of CHOLMOD_DOUBLE.

The analyze phase has more work to do for the GPU factorization, but I'm not sure why it's that slow. I will double check to see what's happening.

@DrTimothyAldenDavis
Copy link
Owner

Regarding the analyze time: I forgot but it's during the symbolic analysis time that I query the GPU to see how much memory it has. That requires a lot of time to "warm up" the GPU. It requires that the GPU be initialized, which takes a lot of time. If you did any subsequent calls to cholmod_analyze, you wouldn't see that run time.

For the case when doing symbolic analysis with an single-precision input matrix: I can simply turn off the GPU analysis in that case.

The remaining issue is just the nan, right?

It's not clear to me which cases give you a nan, and which ones work. Can you clarify? Is it just the case when using MSVC with Intel oneApi MKL 2024.0 + CUDA 12.0, and cm->useGPU=1 (for cholmod_dl_demo)?

And the same thing works fine with WSL, which is short for "Windows System for Linux"?

@syby119
Copy link
Contributor Author

syby119 commented Dec 5, 2023

Regarding the analyze time: I forgot but it's during the symbolic analysis time that I query the GPU to see how much memory it has. That requires a lot of time to "warm up" the GPU. It requires that the GPU be initialized, which takes a lot of time. If you did any subsequent calls to cholmod_analyze, you wouldn't see that run time.

For the case when doing symbolic analysis with an single-precision input matrix: I can simply turn off the GPU analysis in that case.

The remaining issue is just the nan, right?

It's not clear to me which cases give you a nan, and which ones work. Can you clarify? Is it just the case when using MSVC with Intel oneApi MKL 2024.0 + CUDA 12.0, and cm->useGPU=1 (for cholmod_dl_demo)?

And the same thing works fine with WSL, which is short for "Windows System for Linux"?

Yes. I clone the latest dev branch to confirm the result. My computer is Windows 11 with CPU Intel 15-13490F and GPU Geforce RTX 2080Ti. And I installed WSL2 with Ubuntu 22.04 for linux environment simulation. The CUDA library on my system is slightly different from the CUDA library when the issue is opened several days ago. Besides, I uncomment the #define BLAS_TIMER to give more statistics. It seems that only the windows cm->useGPU = 1 gives nan.

Windows
The backend libraries I used are Intel oneApi MKL 2024.0 and CUDA 12.3. I run the cholmod_dl_demo with by specifying the cm->useGPU.

With cm->useGPU = 1

Factorizing A
CHOLMOD warning: matrix not positive definite. file: C:\Users\syby119\github\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086
cholmod error: file: C:\Users\syby119\github\SuiteSparse\CHOLMOD\Supernodal\t_cholmod_super_numeric_worker.c line: 1086 status: 1: matrix not positive definite

CHOLMOD factor:  L:  18000-by-18000 not positive definite (column 499)
  scalar types: int64_t, real, double
  supernodal, LL'.

analyze cputime:        8.7746
factor  cputime:         2.2276 mflop:  50235.7
solve   cputime:         0.0281 mflop:   5663.1
overall cputime:        11.0303 mflop:  10159.6
solve   cputime:         0.0244 mflop:   6533.4 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)):      nan      nan
residual      nan (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     0.0e+00


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls           36 time   1.4143e-03
      GPU calls            9 time   6.7528e-03
GEMM  CPU calls           36 time   3.5098e-03
      GPU calls            9 time   9.1391e-03
POTRF CPU calls           19 time   1.2481e-02
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls           18 time   2.0743e-03
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   1.9479e-02 GPU   1.5892e-02 total:   3.5371e-02
maxresid: nan : test failure

With cm->useGPU = 0

analyze cputime:        0.8032
factor  cputime:         0.6913 mflop: 161868.5
solve   cputime:         0.0298 mflop:   5334.0
overall cputime:         1.5244 mflop:  73515.1
solve   cputime:         0.0264 mflop:   6033.6 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 7.26e-16 7.46e-16
residual  1.7e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          857 time   2.1204e-01
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          683 time   1.0068e-01
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          175 time   1.3356e-01
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          174 time   4.8829e-02
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   4.9511e-01 GPU   0.0000e+00 total:   4.9511e-01
maxresid: 7.45537e-16 : All tests passed
OK maxresid:    7.5e-16 (double) : C:\Users\syby119\Desktop\nd6k.mtx

WSL
The backend libraries I used are Intel oneApi MKL 2024.0 and CUDA 12.1. I run the cholmod_dl_demo with by specifying the cm->useGPU.

With cm->useGPU = 1

analyze cputime:        6.0416
factor  cputime:         2.1840 mflop:  51238.6
solve   cputime:         0.0206 mflop:   7711.5
overall cputime:         8.2462 mflop:  13589.7
solve   cputime:         0.0141 mflop:  11250.3 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 1.60e-15 1.80e-15
residual  1.5e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          630 time   4.5783e-02
      GPU calls          227 time   5.0226e-02
GEMM  CPU calls          471 time   1.5987e-02
      GPU calls          227 time   5.3743e-02
POTRF CPU calls          171 time   2.0983e-02
      GPU calls            4 time   2.1948e-01
TRSM  CPU calls          171 time   8.4232e-02
      GPU calls            3 time   2.2903e-02
time in the BLAS: CPU   1.6699e-01 GPU   3.4635e-01 total:   5.1333e-01
maxresid: 1.79995e-15 : All tests passed
OK maxresid:    1.8e-15 (double) : /home/yy/nd6k.mtx

With cm->useGPU = 0

analyze cputime:        0.6908
factor  cputime:         0.7115 mflop: 157283.7
solve   cputime:         0.0152 mflop:  10447.9
overall cputime:         1.4175 mflop:  79056.4
solve   cputime:         0.0132 mflop:  12045.4 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 8.27e-16 6.89e-16
residual  1.6e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          857 time   2.2203e-01
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          683 time   1.2190e-01
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          175 time   1.2969e-01
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          174 time   6.0546e-02
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   5.3417e-01 GPU   0.0000e+00 total:   5.3417e-01
maxresid: 8.27333e-16 : All tests passed
OK maxresid:    8.3e-16 (double) : /home/yy/nd6k.mtx

@DrTimothyAldenDavis
Copy link
Owner

Thanks for the details. For the Windows case, do you use MSVC or MINGW?

This will be difficult for me to replicate; I don't use Windows at all, myself, so the support I can provide for Windows is very limited.

The CUDA version you have is very new (12.3). Perhaps an older version might be OK. I see you have 12.1 on Linux and 12.3 on Windows.

@syby119
Copy link
Contributor Author

syby119 commented Dec 5, 2023

Thanks for the details. For the Windows case, do you use MSVC or MINGW?

This will be difficult for me to replicate; I don't use Windows at all, myself, so the support I can provide for Windows is very limited.

The CUDA version you have is very new (12.3). Perhaps an older version might be OK. I see you have 12.1 on Linux and 12.3 on Windows.

OK, I’ll give CUDA 11.7 a try.

If you want me to dump the intermediate result. I can send you an e-mail for the result.

// Uncomment this line to get a long dump as a text file (blas_dump.txt), that
// records each call to the BLAS, for development diagnostics only:
// #define BLAS_DUMP

If someone would like to reproduce the issue on Windows, I would like to list the following steps to setup the environment for the project.

The following software is assumed to be installed previously.

The test matrix data is extracted to

C:\Users\syby119\Desktop\nd6k.mtx

The following shell command is assumed to be executed in the same cmd rather than powershell. All the local path of the repository cloned is C:\Users\syby119\github

Step 1: install gmp / mpfr with vcpkg

cd C:\Users\syby119\github
git clone --depth=1 https://github.com/microsoft/vcpkg.git
cd vcpkg
bootstrap-vcpkg.bat
vcpkg install gmp mpfr

Step 2: Get SuiteSparse

cd C:\Users\syby119\github
git clone --depth=1 https://github.com/DrTimothyAldenDavis/SuiteSparse.git

Step 3: Activate oneApi environment

cd C:\Program Files (x86)\Intel\oneAPI
setvars.bat

Step 4: Configure and build the project

cd C:\Users\syby119\github\SuiteSparse
cmake -Bbuild . -DNFORTRAN=ON -DCMAKE_TOOLCHAIN_FILE="C:\Users\syby119\github\vcpkg\scripts\buildsystems\vcpkg.cmake"
cmake --build build --config Release --parallel

Step 5: Copy the dll

cd build/CHOLMOD/Release
copy ..\..\AMD\Release\amd.dll .
copy ..\..\CAMD\Release\camd.dll .
copy ..\..\CCOLAMD\Release\ccolamd.dll .
copy ..\..\COLAMD\Release\colamd.dll .
copy ..\..\SuiteSparse_config\Release\suitesparseconfig.dll .

**Run the demo **

set CHOLMOD_USE_GPU=1
cholmod_dl_demo.exe C:\Users\syby119\Desktop\nd6k.mtx

@syby119
Copy link
Contributor Author

syby119 commented Dec 5, 2023

I have tried CUDA 11.7, but the result is the same, i.e. cholmod_dl_demo.exe gives nan with GPU factorization.

I'm pretty satisfying with the speed of the CPU version, so I'll use the CPU version for now.

I hope some experienced Windows users can help debug the problem someday.

@DrTimothyAldenDavis
Copy link
Owner

OK. I'll go ahead and release SuiteSparse without fixing this, and perhaps add a note about using the GPU on Windows.

@DrTimothyAldenDavis
Copy link
Owner

Alternatively, I could just disable the GPU entirely when working on Windows with MSVC.

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Dec 5, 2023

See f49b6b9
and 68d1071

I'll close this for now and leave it as a "TODO" in the code, to get CUDA working on Windows with MSVC. This isn't the first issue I've had, unfortuntely. Sorry I wasn't able to get it to work.

@ttaneff
Copy link

ttaneff commented Dec 20, 2023

I was struggling with the exact same issue between Intel MKL (2024.0) and CHOLMOD, however I'm using Visual Studio 2019 (16.11.32, Win10) and CMake 3.28 with MSVC directly without relying on vcpkg toolchain. Reading through this i included CUDA (12.1) in the mix and went deeper. I spent few days debugging and I think I might have a working hypotesis, but I need some more benchmarking to be certain for my case and may be @syby119 to test it also on his build environment/toochain.

At the moment I'm on the path that the issue is caused by OpenMP and not CUDA/MKL directly. As soon as I disable OpenMP in the language extensions for C/C++ and override the 'NOT MSVC' clause from 68d1071 - the tests pass and the GPU is utilized. I've tried both MSVC and Intel C/C++ compilers and the behavior is the same as described by @syby119. Disabling OMP and with CUDA enabled - the tests pass (GPU calls verified by using #define BLAS_TIMER).

What I'm investigating:
I started enabling/disabling each '#pragma omp' section in t_cholmod_gpu.c and I've identified that for the nested for loops if iterators are not explicitly defined as private() inside the #omp directive - the factorization fails. And when I explicitly declare them private in the OMP directive, the code executes correctly. I've managed to confirm this only for 'gpu_updateC', but I expect in a day or two to run through all gpu_xxx functions in t_cholmod_gpu.c.
I'm not that familiar with the OpenMP implementations in other compilers, but it seems that Intel/MSVC treat this case differently and iterator variable might get reassigned between different threads. I even made a hard-coded test of setting nthreads=1 and the tests passed. May be someone more knowledgable on the internals can elaborate on this.

P.S. What triggered me stumbling upon this thread initially - It all started from a strange issue I had when trying to link everything statically (including the BLAS) - needless to say mixing Intel's and Microsoft's OpenMP implementations is a recepie for disaster. As a workaround I excluded the MS's implementation of OpenMP (/NODEFAULTLIB:vcomp.lib) and this forced the application to link against symbols from libiomp5md.lib and load libiomp5md.dll only - factorizations run OK. Otherwise I got both vcomp140.dll and libiomp5md.dll loaded in the debugger and factorizations failed. The other workaround was to disable OpenMP in the compiler entirely and use TBB/OMP from the MKL oneAPI only for the internal BLAS parallelization, de facto leaving the CHOLMOD code serial from the MSVC side.

@DrTimothyAldenDavis
Copy link
Owner

Thanks for the update. It's been a difficult problem to tackle since I don't have access to a Windows machine with CUDA. I'll take a look at this, perhaps for a SuiteSparse 7.4.1 release (I'm about to release a stable SuiteSparse 7.4.0).

I'll reopen this issue.

@DrTimothyAldenDavis DrTimothyAldenDavis added bug Something isn't working external bug porting issue, or problem with external library or system labels Dec 22, 2023
@ttaneff
Copy link

ttaneff commented Dec 23, 2023

I can confirm that declaring local function variables as private in t_cholmod_gpu.c, produces the correct behavior using MSVC toolchain. I used these override directives for each of the functions during testing (I hope the short names are indicative of the respective template function names, where they've been applied):

#define PRIVOMP_GPU_CLEAR private(i)
#define PRIVOMP_GPU_REORDERDESC private(k)
#define PRIVOMP_GPU_UPDATEC private(irow, icol)
#define PRIVOMP_GPU_FINASSEM private (i, j, iidx)
#define PRIVOMP_GPU_TRISOLVE private(i, j, iidx)
#define PRIVOMP_GPU_COPYSUPERNODE private(i, j, iidx)

// tl;dr
I did a lot of debugging and found out that something goes wrong ("matrix is not positive definite" is raised) as soon as nthreads > 1 for a single supernode and OMP parallelization actually kicks in. Loop variables are declared inside the local function scope (hence on the stack) so they are shared between OpenMP regions and the local scope, which might be the reason for the observed behavior on MSVC/Intel C++. As I understand it - declaring them explicitly as private ensures that they are not modified from concurent threads messing-up the intended algorithm. I saw in OpenMP specifications that this should be the default behavior (loop variables, implicitly private), however It seems in the above case it is not quite so...
//tl;dr

Regarding the observation of @syby119 about the slow Analyze phase, I've added some modifications to the BLAS_TIMER sections and timed the allocations during the initialization (which actually happens in the supersymbolic stage in matrix structure analyzis).

Here are my results on [email protected]/32GB & RTX 3070Ti/8GB:

  • With useGPU = 1 (GPU enabled and timing of gpuAllocation on my machine)
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze cputime:        2.6667
factor  cputime:         0.8014 mflop: 139632.6
solve   cputime:         0.0149 mflop:  10711.9
overall cputime:         3.4830 mflop:  32174.7
solve   cputime:         0.0142 mflop:  11206.9 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 1.60e-15 1.71e-15
residual  1.4e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          643 time   1.3013e-02
      GPU calls          214 time   1.7242e-02
GEMM  CPU calls          482 time   5.5722e-03
      GPU calls          214 time   2.1085e-03
POTRF CPU calls          171 time   5.2918e-03
      GPU calls            4 time   2.0663e-01
TRSM  CPU calls          171 time   1.4789e-02
      GPU calls            3 time   3.3531e-02
time in the BLAS: CPU   3.8666e-02 GPU   2.5951e-01 total:   2.9818e-01
       NOTE: Additional penalty from GPU.mallocXXX():    1.0377e+00
             ( +time spent in Analyze stage during GPU initialization )
maxresid: 1.71081e-15 : All tests passed
OK maxresid:    1.7e-15 (double) : nd6k.mtx
  • with useGPU = 0 (GPU disabled)
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze cputime:        0.6528
factor  cputime:         0.5139 mflop: 217764.1
solve   cputime:         0.0167 mflop:   9540.8
overall cputime:         1.1833 mflop:  94702.2
solve   cputime:         0.0152 mflop:  10451.2 (100 trials)
solve2  cputime:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 8.51e-16 7.37e-16
residual  1.6e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          857 time   1.4967e-01
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          683 time   7.4413e-02
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          175 time   9.6482e-02
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          174 time   3.8370e-02
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.5894e-01 GPU   0.0000e+00 total:   3.5894e-01
       NOTE: Additional penalty from GPU.mallocXXX():    0.0000e+00
             ( +time spent in Analyze stage during GPU initialization )
maxresid: 8.50568e-16 : All tests passed
OK maxresid:    8.5e-16 (double) : nd6k.mtx

I can not be 100% certain if there is some other performance hit on my machine for using the GPU, but the CPU performs much better for nd6k, using pwtk produces even less gain, as the matrix contains smaller nodes and doesn't utilize the GPU at-all. And I don't have sufficient resources to run audikw_1...

// tl;dr:
The "penalty" (as described thoroughly in the comments) comes from the allocation of host and gpu resources (gpu_allocate(), which is called on row 305 in super_symbolic2, which represents the Analyze stage when SuperNodal is utilized). I don't have the time to modify the demo to perform multiple factorizations with numerical updates of the reordered matrix - but this would present better benchmark comparision between CPU and CPU+GPU offload runs. Also the conditions for minimum size of a supernode as a candidate for GPU offload are a factor (hence the performance is heavily non-zero-structure dependent). Not to mention that an RTX 3070 has ~0.5x the FP64 performance of an A4000 card - after the holydays I might run this on of the office machines which has an A4000 and Win10.
// tl;dr:

P.S. As I'm the opposite case of @DrTimothyAldenDavis - I can not currently test this on a Linux environment (I have a VM CentOS environment, but it doesn't support GPU passthrough and CUDA). I'll see if I have the time to build a physical Linux environment and do more comparisons on the same machine and see if OS/MSVC have something to do with CUDA performance.

@DrTimothyAldenDavis
Copy link
Owner

Thanks for the update!

I see where the problem is. Yes, those variables should be private. The openmp loops need to be rewritten so that they declare their variables internal to the loops themselves. Those loops were written a while back and they use an older style. It's more clear, I think, to rewrite them so that the private clause isn't needed at all.

For example, rather than declaring iidx at the top of the code, it's more clear to do something like

#pragma omp parallel for ...
for (j = ... )
{
    int64_t iidx = whatever ;
}

instead of

int64_t iidx ;
#pragma omp parallel for ... private (iidx)
for (j = ... )
{
     iidx = whatever ;
}

I'll revise the code and re-enable the GPU and MSVC and post an update for you to try.

Thanks again!

@DrTimothyAldenDavis
Copy link
Owner

I will also revise the demos so they factor the matrix twice, to get a different timing result.

@DrTimothyAldenDavis
Copy link
Owner

Thanks for your feedback. I think I've fixed this issue in my latest commit to dev2: f7a8349

I will post an update in SuiteSparse 7.4.0.beta9 shortly, or you can try the dev2 branch.

@DrTimothyAldenDavis DrTimothyAldenDavis added probably fixed now but need confirmation and removed external bug porting issue, or problem with external library or system labels Dec 23, 2023
@syby119
Copy link
Contributor Author

syby119 commented Dec 23, 2023

Thanks for your feedback. I think I've fixed this issue in my latest commit to dev2: f7a8349

I will post an update in SuiteSparse 7.4.0.beta9 shortly, or you can try the dev2 branch.

Great, I have tried the cholmod_dl_demo on dev2 and it works!

---------------------------------- cholmod_dl_demo:
cholmod version 5.1.0
SuiteSparse version 7.4.0
AMD version 3.3.0
COLAMD version 3.3.0
CAMD version 3.3.0
CCOLAMD version 3.3.0
norm (A,inf) = 203.333
norm (A,1)   = 203.333
CHOLMOD sparse:  A:  18000-by-18000, nz 3457658, upper.  OK
CHOLMOD dense:   B:  18000-by-1,   OK
bnorm 1.99994

=== Overall Trial: 0
Analyze: flop 1.11904e+11 lnz 3.97732e+07, time: 4.68379
factor trial: 0, Factorizing A, time: 4.15804
factor trial: 1, Factorizing A, time: 0.616345
factor trial: 2, Factorizing A, time: 0.630749
CHOLMOD factor:  L:  18000-by-18000  supernodal, LL'.  nz 41235423  OK
nmethods: 1
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze time:        4.6838
factor  time:         0.6307 mflop: 177414.6
solve   time:         0.0264 mflop:   6023.3
overall time:         5.3410 mflop:  20981.9
solve   time:         0.0250 mflop:   6355.3 (100 trials)
solve2  time:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 1.67e-15 1.78e-15
residual  1.3e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          638 time   2.8018e-02
      GPU calls          219 time   3.3463e-02
GEMM  CPU calls          478 time   1.5197e-02
      GPU calls          219 time   2.8133e-03
POTRF CPU calls          171 time   6.2970e-03
      GPU calls            4 time   2.1722e-01
TRSM  CPU calls          171 time   2.6038e-02
      GPU calls            3 time   2.3337e-02
time in the BLAS: CPU   7.5550e-02 GPU   2.7684e-01 total:   3.5238e-01

=== Overall Trial: 1
Analyze: flop 1.11904e+11 lnz 3.97732e+07, time: 0.904241
factor trial: 0, Factorizing A, time: 0.692288
factor trial: 1, Factorizing A, time: 0.626583
factor trial: 2, Factorizing A, time: 0.617987
CHOLMOD factor:  L:  18000-by-18000  supernodal, LL'.  nz 41235423  OK
nmethods: 1
Ordering: AMD     fl/lnz     3911.5  lnz/anz       14.8
Ordering: METIS   fl/lnz     2813.6  lnz/anz       11.5
ints in L:          218424, reals in L:        56630095
factor flops 1.11904e+11 nnz(L)        39773188 (w/no amalgamation)
nnz(A*A'):         3457658
flops / nnz(L):    2813.6
nnz(L) / nnz(A):     11.5
analyze time:        0.9042
factor  time:         0.6180 mflop: 181078.6
solve   time:         0.0276 mflop:   5771.7
overall time:         1.5498 mflop:  72308.6
solve   time:         0.0257 mflop:   6197.5 (100 trials)
solve2  time:         0.0000 mflop:      0.0 (100 trials)
peak memory usage:          674 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 1.80e-15 1.80e-15
residual  1.6e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond     3.8e-04


CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          639 time   2.6864e-02
      GPU calls          218 time   3.3522e-02
GEMM  CPU calls          479 time   1.4980e-02
      GPU calls          218 time   2.6991e-03
POTRF CPU calls          171 time   6.2622e-03
      GPU calls            4 time   2.1913e-01
TRSM  CPU calls          171 time   2.6476e-02
      GPU calls            3 time   2.2212e-02
time in the BLAS: CPU   7.4582e-02 GPU   2.7757e-01 total:   3.5215e-01
maxresid: 1.79625e-15 : All tests passed
OK maxresid:    1.8e-15 (double) : C:\Users\syby119\Desktop\nd6k.mtx

Don't forget to allow CUDA support with MSVC in all SuiteSparsePolicy.cmake

option ( SUITESPARSE_USE_CUDA "ON (default): enable CUDA acceleration for SuiteSparse, OFF: do not use CUDA" ON )

instead of

if ( MSVC )
    option ( SUITESPARSE_USE_CUDA "ON: enable CUDA acceleration for SuiteSparse, OFF (default): do not use CUDA" OFF )
    message ( STATUS "CUDA on MSVC is under development and may not be fully functional (it is disabled by default)" )
else ( )
    option ( SUITESPARSE_USE_CUDA "ON (default): enable CUDA acceleration for SuiteSparse, OFF: do not use CUDA" ON )
endif ( )

@syby119
Copy link
Contributor Author

syby119 commented Dec 23, 2023

I found that nsight system is really a great profiling tool for the program. From the timeline I can see that the CUDA Api call cudaMallocHost takes the most of the time. It seems that the demo need to allocated pinned memory for CPU to GPU data transfer once at the beginning of all calculation.

Here're the percentage of the CUDA API call,the kernel execution time is not fully included and can be implied by cuda synchronize function calls.

|Time |Total Time|Num Calls|Avg       |Name                                         |
|---- | ----     |----     |----      |----                                         |
|73.4%|12.340 s  |1        |12.340 s  |cudaMallocHost                               |
|8.4% |1.407 s   |396      |3.554 ms  |cudaEventSynchronize                         |
|7.9% |1.332 s   |258      |5.164 ms  |cudaStreamSynchronize                        |
|4.7% |788.269 ms|1        |788.269 ms|cudaFreeHost                                 |
|2.4% |401.617 ms|27       |14.875 ms |cudaFree                                     |
|1.4% |227.464 ms|330      |689.283 μs|cudaDeviceSynchronize                        |
|0.6% |99.367 ms |9199     |10.802 μs |cudaLaunchKernel                             |
|0.5% |86.248 ms |1        |86.248 ms |cudaMemGetInfo                               |
|0.2% |33.478 ms |1624     |20.614 μs |cudaMemcpyAsync                              |
|0.2% |38.395 ms |20       |1.920 ms  |cudaMalloc                                   |
|0.1% |12.763 ms |810      |15.757 μs |cudaMemcpy2DAsync                            |
|0.1% |14.628 ms |300      |48.758 μs |cudaMemset                                   |
|0.0% |2.413 μs  |3        |804 ns    |cuDeviceGetLuid                              |
|0.0% |3.347 μs  |3        |1.115 μs  |cuModuleGetLoadingMode                       |
|0.0% |4.917 μs  |2        |2.458 μs  |cuInit                                       |
|0.0% |5.565 μs  |2        |2.782 μs  |cudaGetDeviceProperties_v2_v12000            |
|0.0% |157.914 μs|834      |189 ns    |cuGetProcAddress_v2                          |
|0.0% |81.164 μs |48       |1.690 μs  |cudaStreamDestroy                            |
|0.0% |100.724 μs|180      |559 ns    |cudaEventDestroy                             |
|0.0% |11.390 μs |1        |11.390 μs |cuCtxSynchronize                             |
|0.0% |13.338 μs |2        |6.669 μs  |cudaGetDriverEntryPoint_v11030               |
|0.0% |3.384 ms  |1756     |1.927 μs  |cudaStreamWaitEvent                          |
|0.0% |4.631 ms  |3938     |1.176 μs  |cudaEventRecord                              |
|0.0% |872.779 μs|1448     |602 ns    |cudaOccupancyMaxActiveBlocksPerMultiprocessor|
|0.0% |167.003 μs|48       |3.479 μs  |cudaStreamCreate                             |
|0.0% |4.209 ms  |2189     |1.923 μs  |cudaEventQuery                               |
|0.0% |186.108 μs|180      |1.033 μs  |cudaEventCreateWithFlags                     |
|0.0% |1.840 ms  |4949     |371 ns    |cudaStreamGetCaptureInfo_v2_v11030           |
|0.0% |2.450 ms  |307      |7.979 μs  |cudaMemsetAsync                              |
|0.0% |1.221 ms  |6        |203.450 μs|cudaMemcpy                                   |

@syby119
Copy link
Contributor Author

syby119 commented Dec 23, 2023

I would be appericated if cholmod can support single precision calculation on GPU, since Nvidia Geforce series have very little FP64 core compared to FP32 core. Besides, the single precision data storage consumes less memory bandwidth and improves GPU caching performance in L1/L2 and shared memory.

@DrTimothyAldenDavis
Copy link
Owner

Great!

For the single precision support on the GPU: yes, I agree that would be useful. I can't add it now though; it will have to wait for a future enhancement. (That would be best posted as a separate issue, since I'll close this soon, when 7.4.0 looks stable for this issue).

The cudaMallocHost is a known thing. It has to do a large malloc to pin memory on the host, and that is costly. It only needs to be done once for the entire application. It's done on the first call to cholmod_l_analyze when using the GPU.

@DrTimothyAldenDavis
Copy link
Owner

CUDA will be enabled by default in SuiteSparse 7.4.0 on MSVC. I'll add a caveat to the top-level README file, however.

Thanks again for your help in tracking this down, @ttaneff and @syby119 .

@DrTimothyAldenDavis
Copy link
Owner

Fixed in SuiteSparse 7.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants