Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

Closed
RMeli opened this issue Aug 11, 2023 · 5 comments
Closed

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

RMeli opened this issue Aug 11, 2023 · 5 comments
Assignees
Labels
Type:Bug Something isn't working

Comments

@RMeli
Copy link
Member

RMeli commented Aug 11, 2023

DLA-Future's tridiagonal eigensolver returns NaNs for some matrices obtained from CP2K. This appears to happen after the "bulkerification" of PRs #860 and #904.


Investigation

Observations in CP2K:

  • The issue persisted after heavy refactoring of CP2K/DLA-Future integration
    • Input matrix in CP2K and DLA-Future match (checked a bunch of elements)
  • For QS/H2O-32 and QS/H2O-64 results (total energy) appear to match ScaLAPACK
    • For some block sizes, some eigenvalues are not sorted in ascending order
  • For QS/H2O-128 nans are produced and CP2K crashes, for some block sizes
    • Block size of 256 and 512 crash
    • Block size of 1024 works (same total energy as ScaLAPACK)
  • The SCF step at which nans are produced appears to depend on the block size

DLA-Future investigation:

Further investigation is ongoing, but as discussed with @albestro and @msimberg it's good to keep track of this here.

Details

CP2K

H2O-32

ScaLAPACK:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00    0.4     2.82070305      -546.8547872795 -5.47E+02
     2 P_Mix/Diag. 0.40E+00    0.5     2.00626128      -548.3716222436 -1.52E+00
     3 P_Mix/Diag. 0.40E+00    0.5     1.20123309      -549.2863094720 -9.15E-01
     4 P_Mix/Diag. 0.40E+00    0.5     0.73381920      -549.8271179546 -5.41E-01
     5 P_Mix/Diag. 0.40E+00    0.5     0.44662463      -550.1492415452 -3.22E-01
     6 P_Mix/Diag. 0.40E+00    0.5     0.27176753      -550.3418499811 -1.93E-01
     7 P_Mix/Diag. 0.40E+00    0.5     0.16528526      -550.4572302348 -1.15E-01
     8 P_Mix/Diag. 0.40E+00    0.5     0.10047105      -550.5264082764 -6.92E-02
     9 P_Mix/Diag. 0.40E+00    0.5     0.06104581      -550.5679020528 -4.15E-02
    10 DIIS/Diag.  0.11E-03    1.0     0.03768831      -550.5927951629 -2.49E-02
    11 DIIS/Diag.  0.73E-05    0.5     0.00001951      -550.6301331620 -3.73E-02
    12 DIIS/Diag.  0.17E-05    0.6     0.00000433      -550.6301331629 -9.00E-10

  *** SCF run converged in    12 steps ***

DLA-Future:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00    1.1     2.82070305      -546.8547872795 -5.47E+02
     2 P_Mix/Diag. 0.40E+00    1.2     2.00626128      -548.3716222436 -1.52E+00
     3 P_Mix/Diag. 0.40E+00    1.2     1.20123309      -549.2863094720 -9.15E-01
     4 P_Mix/Diag. 0.40E+00    1.2     0.73381920      -549.8271179546 -5.41E-01
     5 P_Mix/Diag. 0.40E+00    1.2     0.44662463      -550.1492415452 -3.22E-01
     6 P_Mix/Diag. 0.40E+00    1.2     0.27176753      -550.3418499811 -1.93E-01
     7 P_Mix/Diag. 0.40E+00    1.2     0.16528526      -550.4572302348 -1.15E-01
     8 P_Mix/Diag. 0.40E+00    1.2     0.10047105      -550.5264082764 -6.92E-02
     9 P_Mix/Diag. 0.40E+00    1.2     0.06104581      -550.5679020528 -4.15E-02
    10 DIIS/Diag.  0.11E-03    1.2     0.03768831      -550.5927951629 -2.49E-02
    11 DIIS/Diag.  0.73E-05    1.2     0.00001951      -550.6301331620 -3.73E-02
    12 DIIS/Diag.  0.17E-05    1.2     0.00000433      -550.6301331629 -8.99E-10

  *** SCF run converged in    12 steps ***

Inspecting the eigenvalues shows that some are not ordered in ascending order (cyclic shift):

>>> import numpy as np
>>> import matin
>>> import matout
# Check that there are no NaNs
>>> np.isnan(matin.a).sum()
0
>>> np.isnan(matout.evecs).sum()
0
>>> np.isnan(matout.evals).sum()
0
>>> matout.evecs[:10]
array([[ 1.49482e-01,  7.15696e-02, -4.77254e-04, ...,  7.07977e-06,
        -7.22529e-04,  4.34057e-04],
       [ 5.39234e-02,  2.46975e-02,  1.82066e-03, ..., -2.36429e-01,
        -1.80969e-01,  2.02647e-02],
       [-9.58799e-02, -4.43881e-02, -3.39398e-03, ..., -2.77497e-02,
        -2.07821e-02,  2.28366e-03],
       ...,
       [-4.52588e-02, -2.20263e-02, -9.55146e-03, ..., -4.00613e-03,
        -4.45791e-03,  1.78821e-03],
       [ 4.38312e-03,  1.98188e-03,  7.23746e-04, ..., -1.39408e-03,
        -2.72728e-03, -2.21272e-04],
       [ 2.05750e-02,  9.72287e-03,  1.94750e-03, ...,  1.30918e-04,
        -3.58272e-04,  1.87918e-03]], dtype=float32)
>>> matout.evals[:10]
array([[-0.838936],
       [-0.83703 ],
       [-0.834113],
       [-0.832972],
       [-0.816725],
       [-0.815055],
       [-0.81234 ],
       [-0.810887],
       [-0.808261],
       [-0.807243]], dtype=float32)
# Check that all elements the input evecs and evals have been changed
>>> np.isclose(matout.evals, -42.0).sum()
0
>>> np.isclose(matout.evecs, -23.0).sum()
0
>>> matout.evals[matout.evals < -10]
array([], dtype=float32)
>>> matout.evecs[matout.evecs < -10]
array([], dtype=float32)
# Use NumPy eigensolver
>>> e, ee = np.linalg.eigh(matin.a)
>>> e[:10]
array([-0.8389363 , -0.83703023, -0.83411264, -0.832972  , -0.81672525,
       -0.8150545 , -0.8123403 , -0.8108873 , -0.8082608 , -0.8072428 ],
      dtype=float32)
>>> np.allclose(matout.evals.squeeze(-1), e)
False
# Check differences
>>> diff = np.abs(matout.evals.squeeze(-1) - e)
>>> matout.evals.squeeze(-1)[diff > 1e-3]
array([4.83699, 4.86062, 4.88254, 4.81451], dtype=float32)
>>> e[diff > 1e-3]
array([4.814508 , 4.8369913, 4.860616 , 4.8825364], dtype=float32)
>>> np.where(diff > 0.001)
(array([1070, 1071, 1072, 1073]),)
>>> e[1069:1075]
array([4.8128195, 4.814508 , 4.8369913, 4.860616 , 4.8825364, 4.9199357],
      dtype=float32)
>>> matout.evals[1069:1075]
array([[4.81282],
       [4.83699],
       [4.86062],
       [4.88254],
       [4.81451],
       [4.91994]], dtype=float32)

H2O-64

ScaLAPACK:

12 DIIS/Diag.	0.76E-05	3.2	0.00000704	-1101.2367103259	-8.93E-09

DLA-Future (time missing from my notes, not relevant here):

12 DIIS/Diag.	0.76E-05		0.00000704	-1101.2367103259	-8.93E-09

The behavior is different depending on the block size.

Block size of 64:

>>> import numpy as np
>>> import matin_0
>>> import matout_0
# Check that there are no NaNs
>>> np.isnan(matin_0.a).sum()
0
>>> np.isnan(matout_0.evecs).sum()
0
>>> np.isnan(matout_0.evals).sum()
0
# Solve with NumPy
>>> e, ee = np.linalg.eigh(matin_0.a)
>>> e[:10]
array([-0.8325389 , -0.83078194, -0.8301129 , -0.828938  , -0.8274322 ,
       -0.8256345 , -0.8236141 , -0.8221858 , -0.81873846, -0.8182086 ],
      dtype=float32)
>>> matout_0.evals[:10]
array([[-0.832539],
       [-0.830782],
       [-0.830113],
       [-0.828938],
       [-0.827432],
       [-0.825635],
       [-0.823614],
       [-0.822186],
       [-0.818738],
       [-0.818209]], dtype=float32)
# Check difference with NumPy
>>> diff = np.abs(matout_0.evals.squeeze(-1) - e)
>>> matout_0.evals.squeeze(-1)[diff > 1e-3]
array([], dtype=float32)
>>> matout_0.evals.squeeze(-1)[diff > 1e-4]
array([], dtype=float32)
>>> matout_0.evals.squeeze(-1)[diff > 1e-5]
array([10.5684, 12.7143, 12.784 , 12.7919, 12.8208, 12.8527, 12.8769,
       12.9043, 12.9286, 12.9424, 12.9458, 12.981 , 13.0004, 13.0243,
       13.0349, 13.0531, 13.0781, 13.0982, 13.1076, 13.1336, 13.143 ,
       13.1551, 13.1633, 13.2074, 13.213 , 13.2453, 13.2615, 13.289 ,
       13.2937, 13.298 , 13.3325, 13.3397, 13.3659, 13.3818, 13.3879,
       13.449 , 13.483 , 13.5031, 13.5312, 13.5696, 13.5928, 13.6196,
       13.6795, 13.6988, 13.727 , 13.7304, 13.7532, 13.8624, 13.9351,
       13.9837, 14.2832, 15.2062], dtype=float32)
>>> np.where(diff > 0.001)
(array([], dtype=int64),)

Differences are small. Calculations in DLA-Future are performed in double precision, but the output is in single precision.

Block size of 128:

>>> import numpy as np
>>> import matout_0 as matout
>>> import matin_0 as matin
>>> e, ee = np.linalg.eigh(matin.a)
>>> diff = np.abs(matout.evals.squeeze(-1) - e)
>>> matout.evals.squeeze(-1)[diff > 1e-3]
array([3.12042, 3.12242, 3.14158, 3.14504, 3.11545], dtype=float32)
>>> e[diff > 1e-3]
array([3.11539  , 3.1204178, 3.1224189, 3.1415825, 3.1450453],
      dtype=float32)
>>> np.isnan(matout.a).sum()
0
>>> np.isnan(matout.evecs).sum()
5603840
>>> np.isnan(matout.evals).sum()
0
>>> np.where(diff > 0.001)
(array([1671, 1672, 1673, 1674, 1675]),)
>>> e[1670:1677]
array([3.1043088, 3.11539  , 3.1204178, 3.1224189, 3.1415825, 3.1450453,
       3.1483176], dtype=float32)
>>> matout.evals[1670:1677]
array([[3.10431],
       [3.12042],
       [3.12242],
       [3.14158],
       [3.14504],
       [3.11545],
       [3.14832]], dtype=float32)

Same cyclic shift observed for H2O-32.

H2O-128

ScaLAPACK:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   24.0     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   25.7     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   25.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   24.6     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   25.7     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   24.8     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   24.8     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   25.8     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   25.0     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   26.2     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   25.9     0.00001917     -2202.5836419845 -1.63E-08
    12 DIIS/Diag.  0.20E-04   24.9     0.00001266     -2202.5836419832  1.29E-09
    13 DIIS/Diag.  0.12E-05   26.0     0.00000152     -2202.5836419863 -3.09E-09

  *** SCF run converged in    13 steps ***

DLA-Future:

Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   15.8            NaN     -2188.1751193349 -2.19E+03

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                KS energy is an abnormal value (NaN/Inf).             *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                                     qs_ks_methods.F:889 *
 *******************************************************************************

Increasing the block size from 256 to 1024 appears to work:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   41.5     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   49.0     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   48.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   43.0     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   48.3     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   48.7     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   46.0     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   62.1     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   61.0     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   45.1     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   79.0     0.00001917     -2202.5836419845 -1.63E-08
    12 DIIS/Diag.  0.20E-04   53.5     0.00001266     -2202.5836419833  1.28E-09
    13 DIIS/Diag.  0.12E-05   48.0     0.00000152     -2202.5836419863 -3.08E-09

  *** SCF run converged in    13 steps ***

DLA-Future

Tridiagonal eigensolver miniapp

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5
HDF5 "td-evals.h5" {
GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): -nan,
      (0,1,0): -0.932222,
      (0,2,0): -0.931106,
      (0,3,0): -0.929937,
      (0,4,0): -0.929327,
      (0,5,0): -0.92884,
      (0,6,0): -0.928138,
      (0,7,0): -0.928094,
      (0,8,0): -0.927308,
      (0,9,0): -0.926688,
      (0,10,0): -0.926199,
      (0,11,0): -0.925902,
      (0,12,0): -0.925131,
      (0,13,0): -0.924653,
      (0,14,0): -0.924199,
      (0,15,0): -0.923768,
      (0,16,0): -0.923444,
      (0,17,0): -0.922861,
      (0,18,0): -0.922458,
      (0,19,0): -0.922242,
      (0,20,0): -0.921639,
      (0,21,0): -0.921327,
      (0,22,0): -0.921066,
      (0,23,0): -0.920946,
      (0,24,0): -0.92021,
      (0,25,0): -0.919626,
      (0,26,0): -0.919327,
      (0,27,0): -0.91917,
      (0,28,0): -0.918547,
      (0,29,0): -0.918441,
      (0,30,0): -0.917988,
      (0,31,0): -0.917847,
      (0,32,0): -0.917361,
      (0,33,0): -0.9172,
      (0,34,0): -0.916821,
      (0,35,0): -0.916179,
      (0,36,0): -0.915846,
      (0,37,0): -0.915483,
      (0,38,0): -nan,
      (0,39,0): -nan,
      (0,40,0): -nan,
      (0,41,0): -nan,
      (0,42,0): -nan,

Experiments with block sizes

Block size of 128:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --block-size 128
GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): nan,
      (0,1,0): nan,
      (0,2,0): nan,
      (0,3,0): nan,
      (0,4,0): nan,
      (0,5,0): nan,
      (0,6,0): nan,
      (0,7,0): nan,
      (0,8,0): nan,
      (0,9,0): nan,
      (0,10,0): nan,
      (0,11,0): nan,
      (0,12,0): nan,
      (0,13,0): nan,
      (0,14,0): nan,
      (0,15,0): nan,
      (0,16,0): nan,
      (0,17,0): nan,
      (0,18,0): nan,
      (0,19,0): nan,
      (0,20,0): nan,
      (0,21,0): nan,
      (0,22,0): nan,
      (0,23,0): nan,
      (0,24,0): nan,
      (0,25,0): nan,
      (0,26,0): nan,
      (0,27,0): nan,
      (0,28,0): nan,
      (0,29,0): nan,
      (0,30,0): nan,
      (0,31,0): nan,
      (0,32,0): nan,
      (0,33,0): nan,
      (0,34,0): nan,
      (0,35,0): nan,
      (0,36,0): nan,
      (0,37,0): nan,
      (0,38,0): nan,
      (0,39,0): nan,
      (0,40,0): nan,
      (0,41,0): nan,
      (0,42,0): nan,
      (0,43,0): nan,

Block size of 256:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --local --block-size 256
GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): nan,
      (0,1,0): -0.932222,
      (0,2,0): -0.931106,
      (0,3,0): -0.929937,
      (0,4,0): -0.929327,
      (0,5,0): -0.92884,
      (0,6,0): -0.928138,
      (0,7,0): -0.928094,
      (0,8,0): -0.927308,
      (0,9,0): -0.926688,
      (0,10,0): -0.926199,
      (0,11,0): -0.925902,
      (0,12,0): -0.925131,
      (0,13,0): -0.924653,
      (0,14,0): -0.924199,
      (0,15,0): -0.923768,
      (0,16,0): -0.923444,
      (0,17,0): -0.922861,
      (0,18,0): -0.922458,
      (0,19,0): -0.922242,
      (0,20,0): -0.921639,
      (0,21,0): -0.921327,
      (0,22,0): -0.921066,
      (0,23,0): -0.920946,
      (0,24,0): -0.92021,
      (0,25,0): -0.919626,
      (0,26,0): -0.919327,
      (0,27,0): -0.91917,
      (0,28,0): -0.918547,
      (0,29,0): -0.918441,
      (0,30,0): -0.917988,
      (0,31,0): -0.917847,
      (0,32,0): -0.917361,
      (0,33,0): -0.9172,
      (0,34,0): -0.916821,
      (0,35,0): -0.916179,
      (0,36,0): -0.915846,
      (0,37,0): -0.915483,
      (0,38,0): nan,
      (0,39,0): nan,
      (0,40,0): nan,
      (0,41,0): nan,
      (0,42,0): nan,
      (0,43,0): nan,

Block size of 1024:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --block-size 1024
GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): -0.932222,
      (0,1,0): -0.931106,
      (0,2,0): -0.929937,
      (0,3,0): -0.929327,
      (0,4,0): -0.928841,
      (0,5,0): -0.928138,
      (0,6,0): -0.928094,
      (0,7,0): -0.927308,
      (0,8,0): -0.926688,
      (0,9,0): -0.9262,
      (0,10,0): -0.925902,
      (0,11,0): -0.925131,
      (0,12,0): -0.924653,
      (0,13,0): -0.924199,
      (0,14,0): -0.923767,
      (0,15,0): -0.923444,
      (0,16,0): -0.922861,
      (0,17,0): -0.922458,
      (0,18,0): -0.922242,
      (0,19,0): -0.921639,
      (0,20,0): -0.921327,
      (0,21,0): -0.921066,
      (0,22,0): -0.920946,
      (0,23,0): -0.92021,
      (0,24,0): -0.919625,
      (0,25,0): -0.919329,
      (0,26,0): -0.91917,
      (0,27,0): -0.918547,
      (0,28,0): -0.918441,
      (0,29,0): -0.917988,
      (0,30,0): -0.917847,
      (0,31,0): -0.917361,
      (0,32,0): -0.9172,
      (0,33,0): -0.91682,
      (0,34,0): -0.916179,
      (0,35,0): -0.915846,
      (0,36,0): -0.915483,
      (0,37,0): -0.915208,
      (0,38,0): -0.915144,
      (0,39,0): -0.914922,
      (0,40,0): -0.914535,
      (0,41,0): -0.914342,
      (0,42,0): -0.91416,
      (0,43,0): -0.91385,

Debugging

  • nan values stem from std::squrt(-w[i]) in solveRank1Problem
    • Some of the w[i] are positive
      • Also happens with --pika:threads=1

z_tiles[0].ptr()[i] = std::copysign(std::sqrt(-w[i]), z_ptr[to_sizet(i)]);

With k = 2146:

  • w[50] = 0.000914332923
  • w[51] = 4.66800714e-07
  • w[165] = 5.93085525e-10
  • w[166] = 0.00505260751

With k = 2044:

  • w[1790] = 1.14765066e-10

With k = 1

  • w[0] = 1
@RMeli RMeli self-assigned this Aug 11, 2023
@RMeli RMeli added the Type:Bug Something isn't working label Aug 11, 2023
@msimberg msimberg moved this to Todo in DLA-F Planning Aug 15, 2023
@albestro
Copy link
Collaborator

With @RMeli we investigated a bit and here I'm going to summarise what we found out. We are not fully sure about it, but it might be relevant for fixing this issue.

  • sqrt(-w[i]) is the origin of NaNs (at step n), which produces NaNs because w[i] > 0 for some i
  • w[i] = f(q, d_ptr) and d_ptr
    • q contains the eigenvectors, modified by laed4
    • d_ptr is the vector with eigenvalues, which should be sorted as laed4 precondition requires

Apparently, the precondition about sorted eigenvalues is not always verified. Indeed, we added checks about d0 sorting at various step of the mergeSubproblems algorithm:

auto mergeSubProblems() {
// d0 is in "initial" state with (potentially) 4 groups
// non-deflated | deflated | non-deflated | deflated
// each pair coming from one of the two subproblems

assembleZVec(rho, e0); // -> z0 = f(e0)

// i1 is created and, from it, i2 is created for sorting all eigenvalues of both problems
CHECK_SORTED(d0, i2);          // d0-post-sort             [0,n)
BACKUP(i2);
BACKUP(d0);

applyDeflation(tol, rho, d0, z0, ...);
k = stablePartition(...); // -> i3

CHECK_BACKUP(i2);              // CHANGE i2
CHECK_BACKUP(d0);              // CHANGE d0
CHECK_SORTED(d0, i2);          // d0-post-stable-full      [0,n)
CHECK_SORTED(d0, i3);          // d0-post-stable-nd-d|k    [0,k)|[k, n)

// they are separated in non-deflated and deflated
// but each group should be sorted
sort(d0, i3); 

CHECK_SORTED(d0);              // d0-sorted|k              [0,k)|[k, n)
}

What we found out is that at some point, applyDeflation introduces values that make the eigenvalues array not fully sorted anymore.

This has been tested with the input provided by CP2K (i.e. 5120 case), with different blocksizes (128, 256).

This sorting problem appears both on pre-bulk (b44340f PR888 + patch for HDF5) and on master, but:

  • on master it results in NaNs
  • on pre-bulk it produces results, but from a quick check in CP2K, it seems that the results are not correct

I look forward to confirmation that expectations about sorting are correct as we stated in this report (@rasolca can you have a look, please?). If yes, from a naive point of view it looks like a numerical problem somewhere.

I'm not sure it will solve the NaNs that @RMeli addressed, because it might be that bulk introduces other approximations. So, for the moment I won't open a separate issue, but in the end, if we realise they are two different problems, we will open a separate issue about this sorting problem.

Details

In the output, you will find also some internal output from applyDeflation that reports just the changes done on eigenvalues, and it looks like a block like this

tol = 1.95696e-05
c = -1 s = -0.000785526
i1s = 257 i2s = 0 @ 1|2
d1 = -0.949937 d2 = -0.932222
d1 = -0.949937 d2 = -0.932222

Now, let's start with an extent of the output of the run with 256 blocksize. I took the very first part of the output and split it in parts, trying to make it easier to read.

[0]
tol = 1.95696e-05
c = -1 s = -0.000785526
i1s = 257 i2s = 0 @ 1|2
d1 = -0.949937 d2 = -0.932222
d1 = -0.949937 d2 = -0.932222
tol = 1.95696e-05
c = 0.819746 s = -0.572728
i1s = 3 i2s = 258 @ 5|6
d1 = -0.928868 d2 = -0.928827
d1 = -0.928854 d2 = -0.92884
tol = 1.95696e-05
c = -0.772187 s = -0.635396
i1s = 340 i2s = 74 @ 158|159
d1 = 1.71013 d2 = 1.71014
d1 = 1.71014 d2 = 1.71014
tol = 1.95696e-05
c = -0.00174197 s = 0.999999
i1s = 185 i2s = 186 @ 392|393
d1 = 12.7213 d2 = 12.7305
d1 = 12.7305 d2 = 12.7213
CHANGE d0 3 -0.928854 -0.928868
CHANGE d0 74 1.71014 1.71014
CHANGE d0 185 12.7305 12.7213
CHANGE d0 186 12.7213 12.7305
CHANGE d0 258 -0.92884 -0.928827
CHANGE d0 340 1.71014 1.71013
d0-post-stable-full NOT SORTED 0:512 1
393,

In the previous part it can be seen that there is an unexpected out of order value in position 393 (they should be fully sorted), and actually index 393 has been changed by applyDeflation.

tol = 1.95696e-05
c = -0.00174197 s = 0.999999
i1s = 185 i2s = 186 @ 392|393
d1 = 12.7213 d2 = 12.7305
d1 = 12.7305 d2 = 12.7213

It looks quite a big change, but I don't have deep knowledge of this algorithm. Anyway, this change does not really end up causing any issue once i3 is applied, so the precondition stands in this case.

<continue from above>
tol = 6.6807e-06
c = 1 s = 1.74006e-05
i1s = 256 i2s = 1 @ 0|2
d1 = -1.15486 d2 = -0.921493
d1 = -1.15486 d2 = -0.921493
tol = 6.6807e-06
c = 0.801806 s = -0.597584
i1s = 193 i2s = 450 @ 387|388
d1 = 6.2362 d2 = 6.23621
d1 = 6.2362 d2 = 6.2362
CHANGE d0 193 6.2362 6.2362
CHANGE d0 450 6.2362 6.23621

At this step in the output above nothing happens, there are changes, but the sorting is not altered.

<continue from above>
tol = 6.56497e-06
c = -1 s = 1.51461e-05
i1s = 0 i2s = 3 @ 0|3
d1 = -0.966395 d2 = -0.911016
d1 = -0.966395 d2 = -0.911016
tol = 6.56497e-06
c = 1 s = -1.08164e-05
i1s = 0 i2s = 5 @ 0|5
d1 = -0.966395 d2 = -0.909847
d1 = -0.966395 d2 = -0.909847
tol = 6.56497e-06
c = 1 s = -9.92931e-05
i1s = 0 i2s = 6 @ 0|6
d1 = -0.966395 d2 = -0.909007
d1 = -0.966395 d2 = -0.909007
tol = 6.56497e-06
c = 1 s = 6.15887e-05
i1s = 0 i2s = 7 @ 0|7
d1 = -0.966395 d2 = -0.908289
d1 = -0.966395 d2 = -0.908289
tol = 6.56497e-06
c = 7.6434e-06 s = 1
i1s = 17 i2s = 257 @ 18|21
d1 = -0.901174 d2 = -0.634082
d1 = -0.634082 d2 = -0.901174
CHANGE d0 17 -0.634082 -0.901174
CHANGE d0 257 -0.901174 -0.634082
d0-post-stable-full NOT SORTED 0:512 2
19, 21,
d0-post-stable-nd-d|486 NOT SORTED 486:512 1
495,
d0-sorted|486 NOT SORTED 486:512 1
495,

We can see that:

  • applyDeflation altered the original full-order given by i2 (index 19, 21 are out of order now)
  • the sorting is not ensured to the following steps of the algorithm (index 495 is out of order, but it is in the deflated part so it does not violate the precondition, at least the laed4 one)

PRE-BULK vs MASTER (with bulk)

This sorting problem exists in both versions, but apparently, just for the bulk it manifests with NaNs.

image

As can be seen from the diff output, at some point deflation has NaNs for post-bulk (right). Since it appears in c and s, it means that at that point z, which is populated from eigenvectors, is already filled with NaNs by the previous merge step. Indeed, the previous step about ordering reports

d0-post-stable-full NOT SORTED 0:2560 22
92, 149, 288, 495, 497, 542, 553, 657, 661, 759, 763, 803, 876, 1150, 1380, 1517, 1731, 1805, 1909, 1976, 2310, 2316,
d0-post-stable-nd-d|2044 NOT SORTED 0:2044 1
1791,
d0-post-stable-nd-d|2044 NOT SORTED 2044:2560 1
2136,
d0-sorted|2044 NOT SORTED 0:2044 1
1791,
d0-sorted|2044 NOT SORTED 2044:2560 1
2136,

and in particular

d0-sorted|2044 NOT SORTED 0:2044 1
1791,

says that d0-sorted has index 1791 out of order, which is in the non-deflated part (1791 < k=2044).

At this point I'm still not sure why pre-bulk and post-bulk behaves differently in terms of results, but for sure they address the same sorting problem (which looks like a problem from my naive knowledge of the algorithm).

@albestro
Copy link
Collaborator

As reported in #960, it was a problem existing in our tridiagonal solver since the very beginning. @rasolca confirmed what we @RMeli and I reported in the very last message, and proposed and implemented a fix.

Just for the records, a note about the bulkerification, since all initial evidences were pointing at it as "the suspect". In this specific case it ended up with NaN values, but apparently it was just a different way of manifesting the same root cause (probably due to different negligible numeric approximations, that are actually present in the two different variants of the algorithm).

@albestro
Copy link
Collaborator

See #960 (comment) for more details.

That PR might have fixed this issue, but we are still not sure. For sure, that PR fixes a problem in tridiagonal solver.

@RMeli
Copy link
Member Author

RMeli commented Sep 4, 2023

The discrepancy from the results reported here and the ones reported in #960 are down to EPS_DEFAULT.

-      EPS_DEFAULT 1.0E-12
+      EPS_DEFAULT 1.0E-5
+  &FM
+    FORCE_BLOCK_SIZE .TRUE.
+    NCOL_BLOCKS 256
+    NROW_BLOCKS 256
+  &END

With EPS_DEFAULT=1.0E-5 I have:

DLA-Future

 SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   16.0     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   17.0     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   16.9     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   16.9     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   16.9     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   16.9     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   16.9     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   17.0     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   16.9     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   16.9     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   17.0     0.00001917     -2202.5836419845 -1.64E-08
    12 DIIS/Diag.  0.20E-04   17.0     0.00001266     -2202.5836419833  1.28E-09
    13 DIIS/Diag.  0.12E-05   17.0     0.00000152     -2202.5836419863 -3.09E-09

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9963829224        0.0036170776
  Core density on regular grids:             1023.9974460717       -0.0025539283
  Total charge density on r-space grids:        0.0010631494
  Total charge density g-space grids:           0.0010631494

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.21912796280003
  Hartree energy:                                            2289.02439878388213
  Exchange-correlation energy:                               -533.21719010794914

  Total energy:                                             -2202.58364198634308

ScaLAPACK

SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   41.5     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   43.9     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   43.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   43.6     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   43.7     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   43.7     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   43.6     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   43.8     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   43.7     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   43.6     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   43.7     0.00001917     -2202.5836419845 -1.64E-08
    12 DIIS/Diag.  0.20E-04   43.7     0.00001266     -2202.5836419833  1.27E-09
    13 DIIS/Diag.  0.12E-05   43.7     0.00000152     -2202.5836419863 -3.08E-09

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9963829224        0.0036170776
  Core density on regular grids:             1023.9974460717       -0.0025539283
  Total charge density on r-space grids:        0.0010631494
  Total charge density g-space grids:           0.0010631494

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.21912796279980
  Hartree energy:                                            2289.02439878388213
  Exchange-correlation energy:                               -533.21719010794914

  Total energy:                                             -2202.58364198634354

@RMeli RMeli closed this as completed Sep 4, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in DLA-F Planning Sep 4, 2023
@RMeli
Copy link
Member Author

RMeli commented Sep 4, 2023

Thanks @albestro and @rasolca for the help in fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants